Chapter 4: data analysis with python

Posted by sun373 on Mon, 17 Feb 2020 10:52:15 +0100

This chapter focuses on the basics of NumPy.
Because NumPy provides a very easy-to-use C language API, the data can be passed to the external class library written in the underlying language, and then the external class library will return the calculation results in the way of NumPy array. This feature enables Python to encapsulate the existing C/C++/Fortran code base and provide dynamic and easy-to-use interfaces for these codes.
NumPy is very efficient for data with a large number of arrays.

  1. NumPy stores data internally on contiguous blocks of memory, unlike other Python built-in data structures.
  2. NumPy's algorithm library is written in C language, so no type checking or other management operations are required when operating data memory.
  3. NumPy arrays also use less memory than other Python built-in sequences.
  4. NumPy can perform complex calculations on a full array without writing Python loops.
    The calculation efficiency is compared as follows:
import numpy as np
my_arr = np.arange(1000000)
my_list = list(range(1000000))
%time for i in range(10) : my_arr2 = my_arr * 2
Wall time: 26 ms
%time for i in range(10) : my_list2 = [x * 2 for x in my_list]
Wall time: 998 ms
NumPy's method is 10 to 100 times faster than Python's and uses less memory.

1. NumPy ndarray - multidimensional array object

A ndarray is a general multi-dimensional homogeneous data container, which contains the same type of each element.

1.1 generate ndarray
  1. The array function receives any sequential object to generate an array of ndarray s.
data1 = [1,2,3,4]
arr1 = np.array(data1)
arr1
array([1, 2, 3, 4])
arr1.shape
(4,)
arr1.ndim
1
arr1.dtype
dtype('int32')
  1. Nested sequences, such as lists of the same length, are automatically converted to multidimensional arrays.
data2 = [[1,2,3,4], [5,6,7,8]]
arr2 = np.array(data2)
arr2
array([[1, 2, 3, 4],
       [5, 6, 7, 8]])
  1. Other functions zeros / ones / empty / ones? Like / full? Like / eye (characteristic matrix)
np.zeros(10)
np.zeros((2,3))
np.ones_like(data)
np.full((2,3), 2)
np.full_like(data, 3)
  1. Range is an array version of range
np.arange(10)
1.2 data type of Darry

arr1.dtype
arr2.astype

1.3 array arithmetic

Arrays can be operated in batches without any for loop, which is called vectorization.
Broadcasting mechanism

1.4 basic index and slice

Unlike Python's built-in list, the slice of an array is the view of the original array. This means that the data is not copied, and any changes to the view will be reflected in the original array.
If you want a copy of the array slice instead of a view, you can do the following:

arr[5:8].copy()

Multidimensional array introduces the concept of axis. Axis 0 is row direction and axis 1 is column direction.
The difference between an index and a slice. For a multidimensional array, if an element, a row or a column or an element is located in a specific location by an index, even if a slice only takes one row or a column, it may still be multidimensional.

arr2d = np.arange(9).reshape(3,3)
arr2d
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
arr_slice = arr2d[:2, 2]
arr_slice
array([2, 5])
arr_slice.shape
(2,)
arr_slice = arr2d[2:, 2:]
arr_slice
array([[8]])
arr_slice.shape
(1, 1)
arr_slice = arr2d[2, 2]
arr_slice
8
arr_slice.shape
()
arr_slice = arr2d[:, 2]
arr_slice
array([2, 5, 8])
arr_slice.shape
(3,)
arr_slice = arr2d[:, 2:]
arr_slice
array([[2],
       [5],
       [8]])
arr_slice.shape
(3, 1)
1.5 Boolean index

The length of the Boolean array must be the same as the index length of the array axis.
A Boolean index always generates a copy of the data when it selects it.

data[names == 'Bob']
data[names != 'Bob']
data[~(names == 'Bob')] #Negate a condition
mask = (names == 'Bob') | (names == 'Will') #The python keywords and and or are useless for Boolean arrays, and must be replaced with & and |.
data[mask]
1.6 magic index

It refers to using an integer array to index data, rearranging the array according to the index position, or selecting a subset in a specific order.

arr = np.empty((8,4))
for i in range(8):
    arr[i] = i
arr
array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.],
       [4., 4., 4., 4.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]])
  1. Pass a list or array that contains the order you want
arr[[4, 3, 0, 6]]
array([[4., 4., 4., 4.],
       [3., 3., 3., 3.],
       [0., 0., 0., 0.],
       [6., 6., 6., 6.]])
  1. However, when there are multiple index arrays, a one-dimensional array will be selected according to the corresponding elements of each index tuple
arr = np.arange(32).reshape(8,4)
arr
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])
arr[[1, 5, 7, 2], [0, 3, 1, 2]]
array([ 4, 23, 29, 10]) # Elements taken out at (1,0) (5,3) etc
  1. To achieve a rectangular region formed by selecting a subset of rows and columns in a matrix, you can do the following
arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]
array([[ 4,  7,  5,  6],
       [20, 23, 21, 22],
       [28, 31, 29, 30],
       [ 8, 11,  9, 10]])
1.7 array transpose and axis change

Transpose arr.T
Method translate can receive tuples containing the axis number to replace the axis
ndarray has a swaaxes method that takes a pair of axis numbers as parameters and adjusts the axes to reorganize the array

arr.swapaxes(1,2)

2. General function -- fast array function by element

The general function is the vectorization encapsulation of some simple functions.
sqrt/exp/maximum/add, etc
multiply multiplies the corresponding elements of the array, not the multiplication of the matrix!

3. Array oriented programming

Using array expressions instead of explicit loops is called vectorization. In general, vectorized array operations are one or two orders of magnitude (or even more) faster than the equivalent implementation of pure python.

points = np.arange(-5,5,0.01)
xs, ys = np.meshgrid(points, points) #The meshgrid function receives two one-dimensional arrays and generates two-dimensional arrays based on all (x,y) pairs of the two arrays
xs
array([[-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       ...,
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99]])
z = np.sqrt(xs ** 2 + ys ** 2)
z
array([[7.07106781, 7.06400028, 7.05693985, ..., 7.04988652, 7.05693985,
        7.06400028],
       [7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
        7.05692568],
       [7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
        7.04985815],
       ...,
       [7.04988652, 7.04279774, 7.03571603, ..., 7.0286414 , 7.03571603,
        7.04279774],
       [7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
        7.04985815],
       [7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
        7.05692568]])
3.1 conditional logical operation array

The numpy.where function is the vectorized version of the ternary expression x if condtion else y.

result = np.where(cond, xarr, yarr)

Equivalent to the following loop code, but much faster

result = [(x if c else y) for x,y, c in zip(xarr, yarr, cond)]

The second and third parameters of the where function can also be scalars, which are often used in data analysis to generate a new array based on an array.

np.where(arr > 0, 2, -2) #Replace positive value with 2 and negative value with - 2 in array arr

The where function also supports scalar and array combination.

np.where(arr>0, 2, arr) #Replace only positive values with 2
3.2 mathematical and statistical methods

Use of aggregate function: sum mean STD min max cum sum (cumulative) cum prod
You can determine the direction of the aggregate function by specifying the axis, axis = 0 for the direction by row (calculated horizontally), axis = 1 for the direction by column (calculated vertically).

arr = np.random.rand(5,4)
arr
array([[0.36089492, 0.96022134, 0.22585348, 0.51348969],
       [0.78062164, 0.37050394, 0.46274186, 0.33013851],
       [0.21334633, 0.93623858, 0.22104246, 0.44392382],
       [0.85695648, 0.14728384, 0.29550666, 0.49136345],
       [0.0505092 , 0.53460095, 0.6288425 , 0.65028889]])
arr.mean(axis=1)
array([0.51511486, 0.48600149, 0.4536378 , 0.44777761, 0.46606039]) #Get 5 values
array.sum(axis=0)
array([2.26232858, 2.94884866, 1.83398696, 2.42920436]) #Get 4 values
3.3 method of Boolean array

The Boolean values True and fast are enforced to 1 and 0, so the sum function can be used to count the number of True in a Boolean array.
any check if there is at least one True in the array
all checks whether each value in the array is True

3.4 ranking

The sort method supports sorting by the specified axis.

3.5 unique values and other set logic

np.unique returns the array formed by sorting the unique values in the array.

np.unique(names)
sorted(set(names)) #Pure python implementation

np.inld can check whether the value in one array exists in another array and return a Boolean array.

4. Use array for file input and output

np.save and np.load functions

arr = np,arange(10)
np.save('some_array', arr) # The default format is uncompressed. The suffix is. npy
np.load('some_array.npy')
np.savez('array_archive.npz', a=arr, b=arr) #Save multiple arrays in an uncompressed file
np.load('array_archive.npz') #When loading the. npz file, you get a dictionary type object. You can call the corresponding array through the key

5. Linear algebra

Pay attention to the product of each element of the matrix in * time in numpy, rather than the point product of the matrix. The point multiplication operation requires the dot function, and the point multiplication operator can also be used@

x.dot(y)
np.dot(x,y)
x @ np.ones(3)

Numpy.linaling is a standard set of functions for matrix decomposition, which can be used for matrix inversion and determinant solution. Common functions include diag, dot, trace, det, etc.

6. Pseudorandom number generation

The pseudo-random number is generated by the algorithm with deterministic behavior according to the random number seed in the random number generator.

np.random.seed(1234)

Random numbers support to extract samples according to various distributions, such as uniform distribution, normal distribution, binomial distribution, Gaussian distribution, chi square distribution, etc., all of which have functions of objects.

7. Random walk

numpy uses instances.

Published 5 original articles, praised 0, visited 87
Private letter follow

Topics: Python less C Programming