Machine learning process record 1
# np.meshgrid generates grid point coordinate matrix. Turn the plan into a 'net map' to facilitate the overall coloring in the back # The NP. Range() function returns a fixed step arrangement with an end point and a start point xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))
Generate a flat network for easy back coloring
# plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Pastel1) # # cmap with c different categories and colors plt.xlim # Set the value display range of xaxis, the same as y
dataSet.shape[0] # Use the numpy function shape[0] to return the number of rows of the dataSet, and use the 0 value to count the number of returned rows along each column np.tile(inx, (dataSetSize, 1))  dataSet # Repeat dataSetSize inX times to make the matrix have the same phase dimension. The first parameter is the Yaxis expansion multiple (increase in the number of rows), and the second is the xaxis expansion multiple (increase in the number of columns). If there is only one parameter, X is the default
Every element of the np matrix must be squared
Use argsort to return the ascending sort of the array. It should be noted that the returned subscript can correspond to the classification label
dict.get(key, default=None) classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
 Key – the key to look up in the dictionary.
 Default – returns the default value if the value of the specified key does not exist. If there are calculation times listed above, then + 1; if not, then 0 + 1 will simplify the code
# Key = operator.itemsetter (1) sorts according to the values of the dictionary # Key = operator.itemsetter (0) sorts by dictionary key # Itself ascending reverse descending # Note that sorting returns a list and the dictionary becomes a tuple {:}  > [(,)] so the returned value list[0][0] is the most sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
 fit_transform() does two things: fit finds the data transformation rules and standardizes the data
 Transform: transform the data, such as data normalization and standardization, and transform the test data according to the same model as the training data to obtain the feature vector. You can use the conversion rules directly, so you don't need fit_transform(), otherwise, the data format (or data parameters) after two standardization will be different
Whether the reinforcement learning algorithm involves the calculation of state transition probability determines whether the algorithm is model free or model base.
The algorithms that can not simulate and predict the change of environment are all modelless, so ql is modelless. To become a model, ql needs to learn the mapping between a state action and the next state independently.
df.mean (axis=1) actually takes the mean value of all columns on each row, rather than retaining the mean value of each column. Perhaps simply remember that axis=0 stands for down, and axis=1 stands for across, as an adverb of method action
let me put it another way:
 A value of 0 means that the method is executed down the label \ index value of each column or row
 A value of 1 indicates that the corresponding method is executed along the label direction of each row or column
for element in a.flat: # Iterate element by element, from left to right, from top to bottom print(element)
Qlearing updates all q values, and then uses the q value to make a decision to select the optimal solution. The decisionmaking choice is the behavior of the current maximum q value.
q value can see the expected value of a certain behavior in the current state, that is, looking into the future at a glance, which can not be achieved by ordinary r function. Therefore, good decisions can be made according to the perfect q value list
Jupyter
Shortcut key operation

Two mode universal shortcut keys
 Shift+Enter, execute the code of this unit and jump to the next unit
 Ctrl+Enter, execute the code of this unit and stay in this unit

Command mode: press ESC to enter
 Y. Switch cell to Code mode
 M. The cell switches to Markdown mode
 A. Add a cell above the current cell
 B. Add a cell under the current cell

Others (understand)
 Double click D to delete the current cell
 Z. Back off
 50. Add line number to current cell <! –
 Ctrl+Shift+P, enter the command in the dialog box and run it directly
 Quickly jump to the first cell, Crtl+Home
 Quickly jump to the last cell, crtl + end  >

Edit mode: press Enter to Enter
 Completion code: variable and method followed by Tab
 Add / uncomment one or more lines of code: Ctrl + / (Mac:CMD + /)

Others (understand):
 Multi cursor operation: Ctrl click (Mac:CMD + click)
 Fallback: Ctrl+Z (Mac:CMD+Z)
 Redo: Ctrl+Y (Mac:CMD+Y)
Matplotlib

Drawing image flow
 1. Create canvas – plt.figure(figsize=(20,8))
 2. Draw image – plot.plot (x, y)
 3. Display image – plt.show()
 Image saving – plt.savefig()
Note: images must be saved before show

Add x,y axis scale [know]
 plt.xticks()
 plt.yticks()
 Note: the first parameter passed in must be a number, not a string. If it is a string, you need to replace it

Add grid display [know]
 plt.grid(linestyle="–", alpha=0.5)

Add description [know]
 plt.xlabel()
 plt.ylabel()
 plt.title()

Image saving [know]
 plt.savefig("path")

plot multiple times [understand]
 Just add it directly

Display legend [know]
 plt.legend(loc="best")
 Note: a label must be set in plot. Plot(). If it is not set, it cannot be displayed

Multiple coordinate system display [understand]
 plt.subplots(nrows=, ncols=)

Application of line chart [know]
 1. Apply to changes in observed data
 2. But draw some mathematical function images
Numpy
When ndarray stores data, the data and data addresses are continuous, which makes it faster to batch operate array elements.
This is because the types of all elements in the ndarray are the same, and the element types in the python list are arbitrary, so the memory of the ndarray can be continuous when storing elements, while the python native list can only find the next element through addressing. Although this also leads to the fact that the ndarray of Numpy is not as good as the python native list in terms of general performance, in scientific computing, Numpy's n Darray can eliminate many circular statements, and the code is much simpler than Python's native list.
Attribute name  Attribute interpretation 

ndarray.shape  Tuple of array dimension 
ndarray.ndim  Array dimension 
ndarray.size  Number of elements in the array 
ndarray.itemsize  Length of an array element (bytes) 
ndarray.dtype  Type of array element 
1 method of generating array
1.1 generate an array of 0 and 1
 np.ones(shape, dtype)
 np.ones_like(a, dtype)
 np.zeros(shape, dtype)
 np.zeros_like(a, dtype)
1.2 generate from existing arrays
1.2.1 generation method
 np.array(object, dtype)
 np.asarray(a, dtype)
a = np.array([[1,2,3],[4,5,6]]) # Create from an existing array a1 = np.array(a) # Equivalent to the form of an index, there is no real creation of a new one a2 = np.asarray(a)
[the external chain picture transfer fails. The source station may have an antitheft chain mechanism. It is recommended to save the picture and upload it directly (imggl2af3bl1632804575531) (machine learning process record. Differences between assets/array and asarray. png)]
1.3 generate fixed range array
1.3.1 np.linspace (start, stop, num, endpoint)
 Create an isometric array  specify the number
 Parameters:
 start: the starting value of the sequence
 stop: the end value of the sequence
 num: the number of equally spaced samples to be generated. The default value is 50
 endpoint: whether the sequence contains the stop value. The default value is true
# Generate equally spaced arrays np.linspace(0, 100, 11)
Return result:
array([ 0., 10., 20., 30., 40., 50., 60., 70., 80., 90., 100.])
1.3.2 np.arange(start,stop, step, dtype)
 Create an isometric array  specify the step size
 parameter
 Step: step size. The default value is 1
np.arange(10, 50, 2)
Return result:
array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48])
1.3.3 np.logspace(start,stop, num)
 Create an isometric sequence
 Parameters:
 num: the number of proportional series to be generated. The default value is 50
# Generate 10^x np.logspace(0, 2, 3)
Return result:
array([ 1., 10., 100.])
1.4 generating random arrays
1.4.1 introduction to using module
 np.random module
1.4.2 normal distribution

np.random.randn(d0, d1, ..., dn)
Function: return one or more sample values from standard normal distribution

np.random.normal(*loc=0.0*, *scale=1.0*, *size=None*)
loc: float
The mean value of this probability distribution (corresponding to the center of the whole distribution)
scale: float
The standard deviation of this probability distribution (corresponding to the width of the distribution, the larger the scale, the fatter, and the smaller the scale, the thinner)
size: int or tuple of ints
The output shape is None by default, and only one value is output

np.random.standard_normal(size=None)
Returns an array of standard normal distributions for a specified shape.
1.4.2 uniform distribution
 np.random.rand(d0,d1,...,dn)
 Returns a group of evenly distributed numbers in * * [0.0, 1.0) * *.
 np.random.uniform(low=0.0, high=1.0, size=None)
 Function: randomly sample from a uniform distribution [low, high]. Note that the definition domain is closed on the left and open on the right, that is, it contains low and does not contain high
 Parameter introduction:
 low: sampling lower bound, float type, the default value is 0;
 high: sampling upper bound, float type, the default value is 1;
 Size: the number of output samples, which is of type int or tuple. For example, if size=(m,n,k), mnk samples will be output, and 1 value will be output by default.
 Return value: ndarray type, whose shape is consistent with that described in the parameter size.
 np.random.randint(low,high=None,size=None,dtype='l')
 Randomly sample from a uniform distribution to generate an integer or Ndimensional integer array,
 Access range: if high is not None, the random integer between [low, high]) is taken; otherwise, the random integer between [0, low]) is taken.
3 shape modification
3.1 ndarray.reshape(shape, order)
 Returns a view with the same data field but different shape s
 Rows and columns are not interchangeable
# When converting shapes, be sure to pay attention to the element matching of the array stock_change.reshape([5, 4]) stock_change.reshape([1,10]) # The shape of the array is modified to: (2, 10),  1: indicates that it passes through the to be calculated
3.2 ndarray.resize(new_shape)
 Modify the shape of the array itself (keep the number of elements the same before and after)
 Rows and columns are not interchangeable
stock_change.resize([5, 4]) # View modified results stock_change.shape (5, 4)
3.3 ndarray.T
 Transpose of array
 Exchange the rows and columns of the array
stock_change.T.shape (4, 5)
4 type modification
4.1 ndarray.astype(type)
 Returns the array after the type is modified
stock_change.astype(np.int32)
4.2 ndarray.tostring([order]) or ndarray.tobytes([order])
 Construct Python bytes that contain the original data bytes in the array
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[12, 3, 34], [5, 6, 7]]]) arr.tostring()
5 array de duplication
5.1 np.unique()
temp = np.array([[1, 2, 3, 4],[3, 4, 5, 6]]) >>> np.unique(temp) array([1, 2, 3, 4, 5, 6])
1 logic operation
# Generate data for 10 students and 5 courses >>> score = np.random.randint(40, 100, (10, 5)) # Take out the scores of the last four students for logical judgment >>> test_score = score[6:, 0:5] # Logical judgment. If the score is greater than 60, it is marked as True; otherwise, it is False >>> test_score > 60 array([[ True, True, True, False, True], [ True, True, True, False, True], [ True, True, False, False, True], [False, True, True, True, True]]) # BOOL assignment, which sets the satisfying condition to the specified value  Boolean index >>> test_score[test_score > 60] = 1 >>> test_score array([[ 1, 1, 1, 52, 1], [ 1, 1, 1, 59, 1], [ 1, 1, 44, 44, 1], [59, 1, 1, 1, 1]])
2 general judgment function
 np.all()
# Judge whether the first two students have passed [0:2,:] >>> np.all(score[0:2, :] > 60) False
 np.any()
# Judge whether the score of the first two students [0:2,:] is greater than 90 >>> np.any(score[0:2, :] > 80) True
3 np.where (ternary operator)
More complex operations can be performed by using np.where
 np.where()
# Judge the top four students. In the top four courses, the score greater than 60 is set as 1, otherwise it is 0 temp = score[:4, :4] np.where(temp > 60, 1, 0)
 Composite logic needs to be used in combination with np.logical_and and np.logical_or
# Judge the top four students. In the first four courses, the score greater than 60 and less than 90 is changed to 1, otherwise it is 0 np.where(np.logical_and(temp > 60, temp < 90), 1, 0) # Judge the top four students. In the first four courses, the score greater than 90 or less than 60 is changed to 1, otherwise it is 0 np.where(np.logical_or(temp > 90, temp < 60), 1, 0)
4 statistical operation
What should I do if I want to know the student's maximum score or make a small score?
4.1 statistical indicators
In the field of data mining / machine learning, the value of statistical indicators is also a way for us to analyze problems. The commonly used indicators are as follows:
 min(a, axis)
 Return the minimum of an array or minimum along an axis.
 max(a, axis])
 Return the maximum of an array or maximum along an axis.
 median(a, axis)
 Compute the median along the specified axis.
 mean(a, axis, dtype)
 Compute the arithmetic mean along the specified axis.
 std(a, axis, dtype)
 Compute the standard deviation along the specified axis.
 var(a, axis, dtype)
 Compute the variance along the specified axis.
4.2 case: statistical calculation of student achievement
During statistics, the value of axis is not necessarily the same. The values of different API axes in Numpy are different. Here, axis 0 represents a column and axis 1 represents a row for statistics
# Next, for the top four students, do some statistical operations # Specify column de statistics temp = score[:4, 0:5] print("Top four students,Maximum score of each subject:{}".format(np.max(temp, axis=0))) print("Top four students,Minimum score of each subject:{}".format(np.min(temp, axis=0))) print("Top four students,Performance fluctuation of each subject:{}".format(np.std(temp, axis=0))) print("Top four students,Average score of each subject:{}".format(np.mean(temp, axis=0)))
result:
Top four students,Maximum score of each subject:[96 97 72 98 89] Top four students,Minimum score of each subject:[55 57 45 76 77] Top four students,Performance fluctuation of each subject:[16.25576821 14.92271758 10.40432602 8.0311892 4.32290412] Top four students,Average score of each subject:[78.5 75.75 62.5 85. 82.25]
If you need to calculate which student has the highest score in a subject?
 np.argmax(temp, axis=)
 np.argmin(temp, axis=)
print("For the top four students, the subscript of the student with the highest score in each subject:{}".format(np.argmax(temp, axis=0)))
result:
For the top four students, the subscript of the student with the highest score in each subject:[0 2 0 0 1]
Differences between np.matmul and np.dot:
Both are matrix multiplication. Matrix and scalar multiplication are prohibited in np.matmul. Np.matmul is no different from np.dot in the inner product operation of vector multiplication vector.
Pandas
There are three data structures in Pandas: Series, DataFrame and MultiIndex (called Panel in the old version).
Series is a onedimensional data structure, DataFrame is a twodimensional tabular data structure, and MultiIndex is a threedimensional data structure.
1.Series
Series is a data structure similar to onedimensional array. It can store any type of data, such as integers, strings, floatingpoint numbers, etc. it is mainly composed of a set of data and related indexes.
1.1 creation of series
# Import pandas import pandas as pd pd.Series(data=None, index=None, dtype=None)
 Parameters:
 Data: incoming data, which can be ndarray, list, etc
 Index: the index must be unique and equal to the length of the data. If no index parameter is passed in, an integer index from 0 to n will be automatically created by default.
 dtype: type of data
Create from existing data
 Specify content, default index
pd.Series(np.arange(10)) # Operation results 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 dtype: int64
 Specify index
pd.Series([6.7,5.6,3,10,2], index=[1,2,3,4,5]) # Operation results 1 6.7 2 5.6 3 3.0 4 10.0 5 2.0 dtype: float64
 Create from dictionary data
color_count = pd.Series({'red':100, 'blue':200, 'green': 500, 'yellow':1000}) color_count # Operation results blue 200 green 500 red 100 yellow 1000 dtype: int64
1.2 Series properties
In order to more easily manipulate the indexes and data in the Series object, two attributes index and values are provided in Series
 index
color_count.index # result Index(['blue', 'green', 'red', 'yellow'], dtype='object')
 values
color_count.values # result array([ 200, 500, 100, 1000])
You can also use indexes to get data:
color_count[2] # result 100
2.DataFrame
DataFrame is an object similar to a twodimensional array or table (such as excel), with both row and column indexes
 Row index indicates different rows. The horizontal index is called index, 0 axis, axis=0
 Column index, columns with different table names, vertical index, called columns, 1 axis, axis=1
2.1 creation of dataframe
# Import pandas import pandas as pd pd.DataFrame(data=None, index=None, columns=None)
 Parameters:
 Index: row label. If no index parameter is passed in, an integer index from 0N will be automatically created by default.
 columns: column label. If no index parameter is passed in, an integer index from 0N will be automatically created by default.
 Create from existing data
Example 1:
pd.DataFrame(np.random.randn(2,3))
Example 2: create a student transcript
# Generate data for 10 students and 5 courses score = np.random.randint(40, 100, (10, 5)) # result array([[92, 55, 78, 50, 50], [71, 76, 50, 48, 96], [45, 84, 78, 51, 68], [81, 91, 56, 54, 76], [86, 66, 77, 67, 95], [46, 86, 56, 61, 99], [46, 95, 44, 46, 56], [80, 50, 45, 65, 57], [41, 93, 90, 41, 97], [65, 83, 57, 57, 40]])
However, in this data form, it is difficult to see what kind of data is stored, and the readability is poor!!
Question: how to make the data more meaningful?
# Using data structures in Pandas score_df = pd.DataFrame(score)
 Add row and column index
# Construct row index sequence subjects = ["chinese", "mathematics", "English", "Politics", "Sports"] # Construct column index sequence stu = ['classmate' + str(i) for i in range(score_df.shape[0])] # Add row index data = pd.DataFrame(score, columns=subjects, index=stu)
2.2 DataFrame properties
 shape
data.shape # result (10, 5)
 index
Row index list of DataFrame
data.index # result Index(['Classmate 0', 'Classmate 1', 'Classmate 2', 'Classmate 3', 'Classmate 4', 'Classmate 5', 'Classmate 6', 'Classmate 7', 'Classmate 8', 'Classmate 9'], dtype='object')
 columns
Column index list for DataFrame
data.columns # result Index(['chinese', 'mathematics', 'English', 'Politics', 'Sports'], dtype='object')
 values
Get the value of array directly
data.values array([[92, 55, 78, 50, 50], [71, 76, 50, 48, 96], [45, 84, 78, 51, 68], [81, 91, 56, 54, 76], [86, 66, 77, 67, 95], [46, 86, 56, 61, 99], [46, 95, 44, 46, 56], [80, 50, 45, 65, 57], [41, 93, 90, 41, 97], [65, 83, 57, 57, 40]])
 T
Transpose
data.T
 head(5): displays the first 5 lines
If no parameter is added, the default is 5 lines. If parameter N is filled in, the first N lines will be displayed
data.head(5)
 tail(5): displays the contents of the last 5 lines
If no parameters are added, the default value is 5 lines. If parameter N is filled in, the last N lines will be displayed
data.tail(5)
2.3 setting of datatframe index
2.3.1 modifying row and column index values
stu = ["student_" + str(i) for i in range(score_df.shape[0])] # It must be modified as a whole data.index = stu
Note: the following modification methods are wrong
# Error modification method data.index[3] = 'student_3'
2.3.2 reset index
 reset_index(drop=False)
 Set new subscript index
 drop: the default value is False. The original index will not be deleted. If True, the original index value will be deleted
# Reset index, drop=False data.reset_index()
2.3.3 set a column value as a new index
 set_index(keys, drop=True)
 keys: list of column index names or column index names
 Drop: Boolean, default true. Delete the original column as a new index
Set new index case
1. Create
df = pd.DataFrame({'month': [1, 4, 7, 10], 'year': [2012, 2014, 2013, 2014], 'sale':[55, 40, 84, 31]}) month sale year 0 1 55 2012 1 4 40 2014 2 7 84 2013 3 10 31 2014
2. Set new index by month
df.set_index('month') sale year month 1 55 2012 4 40 2014 7 84 2013 10 31 2014
3. Set multiple indexes to year and month
df = df.set_index(['year', 'month']) df sale year month 2012 1 55 2014 4 40 2013 7 84 2014 10 31
Note: through the setting just now, the DataFrame becomes a DataFrame with MultiIndex.
3.MultiIndex and Panel
3.1 MultiIndex
MultiIndex is a threedimensional data structure;
Multi level index (also known as hierarchical index) is an important function of pandas. It can have two or more indexes on Series and DataFrame objects.
3.1.1 characteristics of multiindex
Print the row index result of df just now
df.index MultiIndex(levels=[[2012, 2013, 2014], [1, 4, 7, 10]], labels=[[0, 2, 1, 2], [0, 1, 2, 3]], names=['year', 'month'])
Multi level or hierarchical index objects.
 index attribute
 Names: the name of the levels
 levels: tuple value of each level
df.index.names # FrozenList(['year', 'month']) df.index.levels # FrozenList([[1, 2], [1, 4, 7, 10]])
3.1.2 creation of multiindex
arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']] pd.MultiIndex.from_arrays(arrays, names=('number', 'color')) # result MultiIndex(levels=[[1, 2], ['blue', 'red']], codes=[[0, 0, 1, 1], [1, 0, 1, 0]], names=['number', 'color'])
3.2 Panel
3.2.1 panel creation
 class pandas.Panel(data=None, items=None, major_axis=None, minor_axis=None)
 Function: store Panel structure of 3D array
 Parameters:
 Data: ndarray or dataframe
 items: index or array like object, axis=0
 major_axis: index or array like object, axis=1
 minor_axis: index or array like object, axis=2
p = pd.Panel(data=np.arange(24).reshape(4,3,2), items=list('ABCD'), major_axis=pd.date_range('20130101', periods=3), minor_axis=['first', 'second']) # result <class 'pandas.core.panel.Panel'> Dimensions: 4 (items) x 3 (major_axis) x 2 (minor_axis) Items axis: A to D Major_axis axis: 20130101 00:00:00 to 20130103 00:00:00 Minor_axis axis: first to second
3.2.2 viewing panel data
p[:,:,"first"] p["B",:,:]
Note: Pandas has been deprecated since version 0.20.0: the recommended method for representing 3D data is through the MultiIndex method on the DataFrame
4 Summary
 Advantages of pandas [understand]
 Enhance chart readability
 Convenient data processing capability
 Easy to read files
 It encapsulates the drawing and calculation of Matplotlib and Numpy
 series [know]
 establish
 pd.Series([], index=[])
 pd.Series({})
 attribute
 Object.index
 Object.values
 establish
 DataFrame [Master]
 establish
 pd.DataFrame(data=None, index=None, columns=None)
 attribute
 Shape – shape
 Index – row index
 columns – column index
 Values – view values
 T – transpose
 head() – view the contents of the header
 tail() – view the tail content
 DataFrame index
 Global modification is required when modifying
 Object.reset_index()
 Object.set_index(keys)
 establish
 MultiIndex and Panel [understand]
 multiIndex:
 Similar to 3D array in ndarray
 establish:
 pd.MultiIndex.from_arrays()
 Properties:
 Object.index
 panel:
 pd.Panel(data, items, major_axis, minor_axis)
 If you want to see the panel data, you need to index it to dataframe or series
 multiIndex:
1 index operation
In Numpy, we have talked about using index to select sequence and slice selection. pandas also supports similar operations. You can also directly use column names and row names
Scale, or even combination.
1.1 direct use of row column index (column before row)
Get the result of 'close' on February 27, 2018
# Direct use of row column index names (column first, row second) data['open']['20180227'] 23.53 # Unsupported operation # error data['20180227']['open'] # error data[:1, :2]
1.2 using indexes in combination with loc or iloc
Get the results from 'February 27, 2018': 'February 22, 2018', 'open'
# Using loc: only the name of the row column index can be specified data.loc['20180227':'20180222', 'open'] 20180227 23.53 20180226 22.80 20180223 22.88 Name: open, dtype: float64 # Using iloc, you can get it through the index subscript # Obtain the data of the first 3 days and the results of the first 5 columns data.iloc[:3, :5] open high close low 20180227 23.53 25.88 24.16 23.53 20180226 22.80 23.78 23.53 22.80 20180223 22.88 23.37 22.82 22.71
2 assignment operation
Reassign the close column in the DataFrame to 1
# Modify the original value directly data['close'] = 1 # perhaps data.close = 1
3 sorting
There are two forms of sorting, one is to sort the index and the other is to sort the content
3.1 DataFrame sorting
 Use df.sort_values(by=, ascending=)
 Single key or multiple keys to sort,
 Parameters:
 by: Specifies the key to sort the reference
 Ascending: default ascending
 ascending=False: descending
 ascending=True: ascending
# Sort by the opening price. Use ascending to specify sorting by size data.sort_values(by="open", ascending=True).head()
# Sort by multiple keys data.sort_values(by=['open', 'high'])
 Use df.sort_index sorts the indexes
The date index of this stock was originally from large to small, but now it is reordered from small to large
# Sort indexes data.sort_index()
3.2 Series sorting
 Use series.sort_values(ascending=True)
When sorting series, there is only one column and no parameters are required
data['p_change'].sort_values(ascending=True).head() 20150901 10.03 20150914 10.02 20160111 10.02 20150715 10.02 20150826 10.01 Name: p_change, dtype: float64
 Use series.sort_index()
Consistent with df
# Sort indexes data['p_change'].sort_index().head() 20150302 2.62 20150303 1.44 20150304 1.57 20150305 2.02 20150306 8.51 Name: p_change, dtype: float64
4 Summary
 1. Index [Master]
 Direct index  column before row, which is obtained through the indexed string
 loc – the first column and the last column. It is a string that needs to be indexed
 iloc – the first and last columns are indexed by subscripts
 ix – first column and last column, which can be indexed by mixing the above two methods
 2. Assignment [know]
 data[""] = **
 data. =
 3. Sort [know]
 dataframe
 Object.sort_values()
 Object.sort_index()
 series
 Object.sort_values()
 Object.sort_index()
 dataframe
1 arithmetic operation
 add(other)
For example, a mathematical operation plus a specific number
data['open'].add(1) 20180227 24.53 20180226 23.80 20180223 23.88 20180222 23.25 20180214 22.49
 sub(other)'
2 logic operation
2.1 logical operation symbols
 For example, filter the date data of data ["open"] > 23
 data ["open"] > 23 return logical results
data["open"] > 23 20180227 True 20180226 False 20180223 False 20180222 False 20180214 False # The results of logical judgment can be used as the basis for screening data[data["open"] > 23].head()
 Complete multiple logical judgments,
data[(data["open"] > 23) & (data["open"] < 24)].head()
2.2 logic operation function
 query(expr)
 expr: query string
query makes the process more convenient and simple
data.query("open<24 & open>23").head()
 isin(values)
For example, judge whether 'open' is 23.53 and 23.85
# You can specify a value to make a judgment, so as to filter data[data["open"].isin([23.53, 23.85])]
3 statistical operation
3.1 describe
Comprehensive analysis: many statistical results can be obtained directly, such as count, mean, std, min, max, etc
# Calculate the mean, standard deviation, maximum and minimum values data.describe()
When a single function is used for statistics, the coordinate axis is still the default column "columns" (axis=0, default). If the row "index" needs to be specified (axis=1)

max(),min()

max(),min()
# Use the statistical function: 0 represents the column calculation result, and 1 represents the row calculation result data.max(0)
1 pandas.DataFrame.plot
 DataFrame.plot(kind='line')
 Kind: STR, the type of graphics to be drawn
 'line' : line plot (default)
 'bar' : vertical bar plot
 'barh' : horizontal bar plot
 Interpretation of "barh":
 http://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.DataFrame.plot.barh.html
 'hist' : histogram
 'pie' : pie plot
 'scatter' : scatter plot
1 CSV
1.1 read_csv
 pandas.read_csv(filepath_or_buffer, sep =',', usecols )
 filepath_or_buffer: file path
 sep: separator, separated by "," by default
 usecols: Specifies the column name to read, in list form
 Example: read the data of previous stocks
# Read the file and specify that only 'open' and 'close' indicators are obtained data = pd.read_csv("./data/stock_day.csv", usecols=['open', 'close']) open close 20180227 23.53 24.16 20180226 22.80 23.53 20180223 22.88 22.82 20180222 22.25 22.28 20180214 21.49 21.92
1.2 to_csv
 DataFrame.to_csv(path_or_buf=None, sep=', ', columns=None, header=True, index=True, mode='w', encoding=None)
 path_or_buf: file path
 sep: separator, separated by "," by default
 columns: select the desired column index
 Header: Boolean or list of string, default true, whether to write into the column index value
 Index: write index
 mode: 'w': Rewrite, 'a' append
 Example: save the read stock data
 Save the data in the 'open' column, and then read and view the results
# Select 10 rows of data to save for easy observation data[:10].to_csv("./data/test.csv", columns=['open']) # Read and view results pd.read_csv("./data/test.csv") Unnamed: 0 open 0 20180227 23.53 1 20180226 22.80 2 20180223 22.88 3 20180222 22.25 4 20180214 21.49 5 20180213 21.40 6 20180212 20.70 7 20180209 21.20 8 20180208 21.79 9 20180207 22.69
You will find that the index is stored in a file and becomes a separate column of data. If you need to delete, you can specify the index parameter to delete the original file and save it again.
# Index: the storage will not change the index value into a column of data data[:10].to_csv("./data/test.csv", columns=['open'], index=False)
1 how to deal with nan
 Tag method for obtaining missing values (NaN or other tag methods)
 If the missing value is marked with NaN
 Judge whether NaN is included in the data:
 pd.isnull(df),
 pd.notnull(df)
 Missing value nan:
 1. Delete dropna with missing value (axis = 'rows')
 Note: the original data will not be modified, and the return value needs to be accepted
 2. Replace missing value: fillna(value, inplace=True)
 Value: replace with the value of
 inplace:True: the original data will be modified, False: the original data will not be replaced and modified, and a new object will be generated
 1. Delete dropna with missing value (axis = 'rows')
 Judge whether NaN is included in the data:
 If the missing value is not marked with NaN, such as "?"
 Replace '?' first Set to np.nan and continue processing
2.2 there is a missing value Nan and it is np.nan
 1. Delete
pandas deletes missing values. The premise of using dropna is that the type of missing values must be np.nan
# Do not modify original data movie.dropna() # You can define a new variable to accept or use the original variable name data = movie.dropna()
 2. Replace missing values
# Replace two columns of samples with missing values # Replace fill average, median # movie['Revenue (Millions)'].fillna(movie['Revenue (Millions)'].mean(), inplace=True)
Replace all missing values:
for i in movie.columns: if np.all(pd.notnull(movie[i])) == False: print(i) movie[i].fillna(movie[i].mean(), inplace=True)
Analysis of treatment ideas:
 1. Replace '?' first np.nan
 df.replace(to_replace=, value=)
 to_replace: value before replacement
 Value: value after replacement
 df.replace(to_replace=, value=)
# Replace the missing values marked by some other values with np.nan wis = wis.replace(to_replace='?', value=np.nan)
 2. Missing values are being processed
# delete wis = wis.dropna()
 pd.qcut(data, q):
 Group the data. Group the data, usually with value_ Count is used together to count the number of each group
 series.value_counts(): counts the grouping times
User defined interval grouping:
 pd.cut(data, bins)
pd.concat to realize data merging
 pd.concat([data1, data2], axis=1)
 Merge by row or column. axis=0 is the column index and axis=1 is the row index
pd.merge
 pd.merge(left, right, how='inner', on=None)
 You can specify to merge or separate the left and right according to the common key value pairs of the two sets of data
 left: DataFrame
 right: another DataFrame
 on: specified common key
 How: how to connect
kernel function
Mapping data to highdimensional representation to simplify the classification problem requires the use of kernel trick, which is named for this core idea.
The basic idea is: to find a good decision hyperplane in the new representation space, you do not need to directly calculate the coordinates of points in the new space, but only need to calculate the distance between point pairs in the new space, and this calculation can be completed efficiently by using kernel function. Kernel function is a computationally realizable operation, which maps any two points in the original space to the distance between these two points in the target representation space, completely avoiding the direct calculation of the new representation. The kernel function is usually selected artificially, not learned from the data  for SVM, only the segmentation hyperplane is obtained by learning.
Before deep learning, people need to spend a lot of time on Feature Engineering, that is, processing data, so that the input data is more suitable for the corresponding methods
Dropout
Dropout: when propagating forward, let the activation value of a neuron stop working with a certain probability p, which can make the model more generalized, because it will not rely too much on some local features. [external chain picture transfer failed, and the source station may have antitheft chain mechanism. It is recommended to save the picture and upload it directly (imgcj5krzyz1632804575537) (machine learning process record. Assets / watermark, type_zmfuz3pozw5nagvpdgk, shadow_10, text_ahr0chm6ly9ibg9nlmnzg4ubmv0l3fxzqxnji3njqy, size_16, color_ffff, t_70. JPEG)]
convolution
The patterns learned by convolutional neural network have translation invariance. After learning a pattern in the lower right corner of the image, the convolutional neural network can recognize the pattern anywhere, such as the upper left corner. For densely connected networks, if the pattern appears in a new location, it can only relearn the pattern. This makes the convolutional neural network can make efficient use of data when processing images (because the visual world is fundamentally translation invariant). It only needs fewer training samples to learn the data representation with generalization ability.
word2vec
word2vec: it can be understood as a dimension reduction process for the word onehot vector. An ndimensional onehot vector is transformed into an mdimensional spatial real vector through a mapping relationship (it can be understood that the points on the original coordinate axis are compressed and embedded into a more compact space). Due to the particularity of onehot vector in matrix multiplication, Each k line in the n*m matrix representing the mapping relationship we get actually represents the kth word in the corpus.
There are two main ways to train words in the corpus by using this processing method of spatial compression and dimensionality reduction
Skip gram neural network training model: a fully connected neural network with hidden layer 1, and there is no activation function in the hidden layer. The output layer uses softmax classifier to output probability. The input is a word, the output is the probability that each word is the context of the input word, and the real value is a word in the context of the input word.
CBOW: the principle is similar to skip gram, but the input is context information and the output is the central word in the information.
Adaboost
AdaBoost's add means adaptive.
The operation process is as follows: each sample in the training data is given a weight, which constitutes the question D. At first, the weights are initialized to equal values. First, a weak classifier is trained on the training data and the error rate of the classifier is calculated, and then the weak classifier is trained again on the same data set. In the second training of the classifier, the weight of each sample will be readjusted, in which the weight of the first right sample will be reduced, and the weight of the first wrong sample will be increased. In order to get the final classification results from all weak classifiers, AdaBoost assigns a weight value alpha to each classifier, which is calculated based on the error rate of each weak classifier.
Like other Boosting algorithms, Gradient Boosting integrates several models with general performance (usually decision trees with fixed depth) into a better model. Abstractly speaking, the training process of the model is the optimization process of an arbitrary differentiable objective function. By repeatedly selecting a function pointing to the negative gradient direction, the algorithm can be regarded as optimizing the objective function in the function space. Therefore, it can be said that Gradient Boosting = Gradient Descent + Boosting.
Like AdaBoost, Gradient Boosting repeatedly selects a model with general performance and adjusts it based on the performance of the previous model each time. The difference is that AdaBoost locates the deficiency of the model by increasing the weight of misdivided data points, while Gradient Boosting locates the deficiency of the model by calculating gradient. Therefore, Gradient Boosting can use more kinds of objective functions than AdaBoost.
The difference between loc and iloc
pandas obtains the value of A column in A dictionary like manner, such as df ['A'], which will get the A column of df. What if we're interested in something? At this time, there are two methods, one is iloc method, the other is loc method. loc means location, and i in iloc means integer. The differences between the two are as follows:
loc: works on labels in the index.
iloc: works on the positions in the index (so it only takes integers).
In other words, loc indexes according to the index. For example, if df below defines an index, loc indexes the corresponding rows according to the index. iloc is not indexed according to the index, but according to the line number. The line number starts from 0 and adds 1 one by one.
In [1]: df = DataFrame(randn(5,2),index=range(0,10,2),columns=list('AB')) In [2]: df Out[2]: A B 0 1.068932 0.794307 2 0.470056 1.192211 4 0.284561 0.756029 6 1.037563 0.267820 8 0.538478 0.800654 In [5]: df.iloc[[2]] Out[5]: A B 4 0.284561 0.756029 In [6]: df.loc[[2]] Out[6]: A B 2 0.470056 1.192211
Each sample in the training data is given a weight, which constitutes the question D. At first, the weights are initialized to equal values. First, a weak classifier is trained on the training data and the error rate of the classifier is calculated, and then the weak classifier is trained again on the same data set. In the second training of the classifier, the weight of each sample will be readjusted, in which the weight of the first right sample will be reduced, and the weight of the first wrong sample will be increased. In order to get the final classification results from all weak classifiers, AdaBoost assigns a weight value alpha to each classifier, which is calculated based on the error rate of each weak classifier.
Like other Boosting algorithms, Gradient Boosting integrates several models with general performance (usually decision trees with fixed depth) into a better model. Abstractly speaking, the training process of the model is the optimization process of an arbitrary differentiable objective function. By repeatedly selecting a function pointing to the negative gradient direction, the algorithm can be regarded as optimizing the objective function in the function space. Therefore, it can be said that Gradient Boosting = Gradient Descent + Boosting.
Like AdaBoost, Gradient Boosting repeatedly selects a model with general performance and adjusts it based on the performance of the previous model each time. The difference is that AdaBoost locates the deficiency of the model by increasing the weight of misdivided data points, while Gradient Boosting locates the deficiency of the model by calculating gradient. Therefore, Gradient Boosting can use more kinds of objective functions than AdaBoost.
The difference between loc and iloc
pandas obtains the value of A column in A dictionary like manner, such as df ['A'], which will get the A column of df. What if we're interested in something? At this time, there are two methods, one is iloc method, the other is loc method. loc means location, and i in iloc means integer. The differences between the two are as follows:
loc: works on labels in the index.
iloc: works on the positions in the index (so it only takes integers).
In other words, loc indexes according to the index. For example, if df below defines an index, loc indexes the corresponding rows according to the index. iloc is not indexed according to the index, but according to the line number. The line number starts from 0 and adds 1 one by one.
In [1]: df = DataFrame(randn(5,2),index=range(0,10,2),columns=list('AB')) In [2]: df Out[2]: A B 0 1.068932 0.794307 2 0.470056 1.192211 4 0.284561 0.756029 6 1.037563 0.267820 8 0.538478 0.800654 In [5]: df.iloc[[2]] Out[5]: A B 4 0.284561 0.756029 In [6]: df.loc[[2]] Out[6]: A B 2 0.470056 1.192211