Look at the 12 efficient Numpy&Pandas skills. You'll lose if you don't look at them

Posted by Caps on Sun, 26 Dec 2021 12:47:55 +0100

In this paper, we share 12 kinds of Numpy and Pandas functions. These efficient functions will make data analysis easier and convenient. Finally, readers can also find the Jupyter Notebook of the code used in this article in the GitHub project.

Project address: https://github.com/kunaldhariwal/12-Amazing-Pandas-NumPy-Functions

Six efficient functions of Numpy

Start with numpy. Numpy is a Python language extension package for scientific computing. It usually contains powerful N-dimensional array objects, complex functions, tools for integrating C/C + + and Fortran code, and useful linear algebra, Fourier transform and random number generation capabilities.

In addition to the above obvious purposes, Numpy can also be used as an efficient multidimensional container for general data to define any data type, which enables Numpy to realize its seamless and rapid integration with various databases.

Next, analyze six Numpy functions one by one.

1,argpartition()

With the help of argpartition(), Numpy can find the indexes with the N largest values, and will also output the found indexes. Then we sort the values as needed.

x = np.array([12, 10, 12, 0, 6, 8, 9, 1, 16, 4, 6, 0])index_val = np.argpartition(x, -4)[-4:]
index_val
array([1, 8, 2, 0], dtype=int64)np.sort(x[index_val])
array([10, 12, 12, 16])

2,allclose()

Allclose() is used to match two arrays and get the output represented by Boolean values. If two arrays are not equal within a tolerance, allclose() returns False. This function is useful for checking whether two arrays are similar.

array1 = np.array([0.12,0.17,0.24,0.29])
array2 = np.array([0.13,0.19,0.26,0.31])# with a tolerance of 0.1, it should return False:
np.allclose(array1,array2,0.1)
False# with a tolerance of 0.2, it should return True:
np.allclose(array1,array2,0.2)
True

3,clip()

Clip () keeps the values in an array within an interval. Sometimes, we need to ensure that the value is within the upper and lower limits. For this purpose, we can achieve this with the help of Numpy's clip() function. Given an interval, the values outside the interval are cut to the interval edge.

x = np.array([3, 17, 14, 23, 2, 2, 6, 8, 1, 2, 16, 0])np.clip(x,2,5)
array([3, 5, 5, 5, 2, 2, 5, 5, 2, 2, 5, 2])

4,extract()

As the name suggests, extract() extracts specific elements from an array under specific conditions. With the help of extract(), we can also use conditions such as and and or.

# Random integers
array = np.random.randint(20, size=12)
array
array([ 0,  1,  8, 19, 16, 18, 10, 11,  2, 13, 14,  3])#  Divide by 2 and check if remainder is 1
cond = np.mod(array, 2)==1
cond
array([False,  True, False,  True, False, False, False,  True, False, True, False,  True])# Use extract to get the values
np.extract(cond, array)
array([ 1, 19, 11, 13,  3])# Apply condition on extract directly
np.extract(((array < 3) | (array > 15)), array)
array([ 0,  1, 19, 16, 18,  2])

5,where()

Where() is used to return elements that meet specific conditions from an array. For example, it returns the index position of a value that meets a specific condition. Where() is similar to the where condition used in SQL, as shown in the following example:

y = np.array([1,5,6,8,1,7,3,6,9])# Where y is greater than 5, returns index position
np.where(y>5)
array([2, 3, 5, 7, 8], dtype=int64),)# First will replace the values that match the condition, 
# second will replace the values that does not
np.where(y>5, "Hit", "Miss")
array([ Miss ,  Miss ,  Hit ,  Hit ,  Miss ,  Hit ,  Miss ,  Hit ,  Hit ],dtype= <U4 )

6,percentile()

Percentile() is used to calculate the nth percentile of array elements in a specific axis direction.

a = np.array([1,5,6,8,1,7,3,6,9])print("50th Percentile of a, axis = 0 : ",  
      np.percentile(a, 50, axis =0))
50th Percentile of a, axis = 0 :  6.0b = np.array([[10, 7, 4], [3, 2, 1]])print("30th Percentile of b, axis = 0 : ",  
      np.percentile(b, 30, axis =0))
30th Percentile of b, axis = 0 :  [5.1 3.5 1.9]

These are the six efficient functions of Numpy extension package, which I believe will help you. Next, take a look at the six functions of Pandas data analysis library.

Six efficient functions of Pandas

Pandas is also a Python package. It provides fast, flexible and highly expressive data structures. It aims to make processing structured (tabular, multidimensional, heterogeneous) and time series data simple and intuitive.

Pandas is applicable to the following types of data:

  • Table data with heterogeneous type columns, such as SQL table or Excel table;
  • Ordered and disordered (not necessarily fixed frequency) time series data;
  • Any matrix data with row / column labels (isomorphic or heterogeneous);
  • Any other form of statistical data set. In fact, data can be put into Pandas structures without tags at all.

1,read_csv(nrows=n)

One mistake most people make is when they don't need it csv file, it will still be read completely. If an unknown If the csv file has 10GB, read the whole file csv files will be very unwise, not only take up a lot of memory, but also take a lot of time. All we need to do is start from Import a few lines into the csv file, and then continue importing as needed.

import io
import requests# I am using this online data set just to make things easier for you guys
url = "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/AirPassengers.csv"
s = requests.get(url).content# read only first 10 rows
df = pd.read_csv(io.StringIO(s.decode( utf-8 )),nrows=10 , index_col=0)

2,map()

The map() function maps the values of Series based on the corresponding inputs. Used to replace each value in a Series with another value, which may come from a function, a dict or Series.

# create a dataframe
dframe = pd.DataFrame(np.random.randn(4, 3), columns=list( bde ), index=[ India ,  USA ,  China ,  Russia ])#compute a formatted string from each floating point value in frame
changefn = lambda x:  %.2f  % x# Make changes element-wise
dframe[ d ].map(changefn)

3,apply()

apply() allows the user to pass a function and apply it to each value in the Pandas sequence.

# max minus mix lambda fn
fn = lambda x: x.max() - x.min()# Apply this on dframe that we ve just created above
dframe.apply(fn)

4,isin()

lsin() is used to filter data frames. Isin() helps you select rows with specific (or multiple) values in a specific column.

# Using the dataframe we created for read_csv
filter1 = df["value"].isin([112]) 
filter2 = df["time"].isin([1949.000000])df [filter1 & filter2]

5,copy()

The copy() function copies the Pandas object. When a data frame is assigned to another data frame, if one data frame is changed, the value of the other data frame will also be changed. To prevent such problems, you can use the copy () function.

# creating sample series 
data = pd.Series([ India ,  Pakistan ,  China ,  Mongolia ])# Assigning issue that we face
data1= data
# Change a value
data1[0]= USA 
# Also changes value in old dataframe
data# To prevent that, we use
# creating copy of series 
new = data.copy()# assigning new values 
new[1]= Changed value # printing data 
print(new) 
print(data)

6,select_dtypes()

select_dtypes() returns a subset of data frame columns based on dtypes columns. The parameters of this function can be set to include all columns with a specific data type, or to exclude columns with a specific data type.

# We ll use the same dataframe that we used for read_csv
framex =  df.select_dtypes(include="float64")# Returns only time column

Finally, pivot_table() is also a very useful function in Pandas. If for pivot_ If you know something about the use of table () in excel, it's very easy to get started.

# Create a sample dataframe
school = pd.DataFrame({ A : [ Jay ,  Usher ,  Nicky ,  Romero ,  Will ], 
       B : [ Masters ,  Graduate ,  Graduate ,  Masters ,  Graduate ], 
       C : [26, 22, 20, 23, 24]})# Lets create a pivot table to segregate students based on age and course
table = pd.pivot_table(school, values = A , index =[ B ,  C ], 
                         columns =[ B ], aggfunc = np.sum, fill_value="Not Available") 

table

The above is what I share. I hope it can be helpful to you!

If you like this article, you can praise and pay attention! Your must be a driving force for me to move forward.

Topics: Python numpy pandas