Machine learning process record

Posted by visionmaster on Tue, 28 Sep 2021 06:22:46 +0200

Machine learning process record 1

# np.meshgrid generates grid point coordinate matrix. Turn the plan into a 'net map' to facilitate the overall coloring in the back
# The NP. Range() function returns a fixed step arrangement with an end point and a start point

xx, yy = np.meshgrid(np.arange(x_min, x_max, .02),
                     np.arange(y_min, y_max, .02))

Generate a flat network for easy back coloring

# plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Pastel1)
# # cmap with c different categories and colors

plt.xlim # Set the value display range of x-axis, the same as y
dataSet.shape[0]
#  Use the numpy function shape[0] to return the number of rows of the dataSet, and use the 0 value to count the number of returned rows along each column

np.tile(inx, (dataSetSize, 1)) - dataSet
 # Repeat dataSetSize inX times to make the matrix have the same phase dimension. The first parameter is the Y-axis expansion multiple (increase in the number of rows), and the second is the x-axis expansion multiple (increase in the number of columns). If there is only one parameter, X is the default

Every element of the np matrix must be squared

Use argsort to return the ascending sort of the array. It should be noted that the returned subscript can correspond to the classification label

dict.get(key, default=None)

classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
  • Key – the key to look up in the dictionary.
  • Default – returns the default value if the value of the specified key does not exist. If there are calculation times listed above, then + 1; if not, then 0 + 1 will simplify the code
 # Key = operator.itemsetter (1) sorts according to the values of the dictionary
 # Key = operator.itemsetter (0) sorts by dictionary key
# Itself ascending reverse descending 
# Note that sorting returns a list and the dictionary becomes a tuple {:} - > [(,)] so the returned value list[0][0] is the most
sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
  • fit_transform() does two things: fit finds the data transformation rules and standardizes the data
  • Transform: transform the data, such as data normalization and standardization, and transform the test data according to the same model as the training data to obtain the feature vector. You can use the conversion rules directly, so you don't need fit_transform(), otherwise, the data format (or data parameters) after two standardization will be different

Whether the reinforcement learning algorithm involves the calculation of state transition probability determines whether the algorithm is model free or model base.

The algorithms that can not simulate and predict the change of environment are all modelless, so ql is modelless. To become a model, ql needs to learn the mapping between a state action and the next state independently.

df.mean (axis=1) actually takes the mean value of all columns on each row, rather than retaining the mean value of each column. Perhaps simply remember that axis=0 stands for down, and axis=1 stands for across, as an adverb of method action

let me put it another way:

  • A value of 0 means that the method is executed down the label \ index value of each column or row
  • A value of 1 indicates that the corresponding method is executed along the label direction of each row or column
for element in a.flat:  # Iterate element by element, from left to right, from top to bottom
    print(element)

Qlearing updates all q values, and then uses the q value to make a decision to select the optimal solution. The decision-making choice is the behavior of the current maximum q value.

q value can see the expected value of a certain behavior in the current state, that is, looking into the future at a glance, which can not be achieved by ordinary r function. Therefore, good decisions can be made according to the perfect q value list

Jupyter

Shortcut key operation

  • Two mode universal shortcut keys

    • Shift+Enter, execute the code of this unit and jump to the next unit
    • Ctrl+Enter, execute the code of this unit and stay in this unit
  • Command mode: press ESC to enter

    • Y. Switch cell to Code mode
    • M. The cell switches to Markdown mode
    • A. Add a cell above the current cell
    • B. Add a cell under the current cell
  • Others (understand)

    • Double click D to delete the current cell
    • Z. Back off
    • 50. Add line number to current cell <! –
    • Ctrl+Shift+P, enter the command in the dialog box and run it directly
    • Quickly jump to the first cell, Crtl+Home
    • Quickly jump to the last cell, crtl + end -- >
  • Edit mode: press Enter to Enter

    • Completion code: variable and method followed by Tab
    • Add / uncomment one or more lines of code: Ctrl + / (Mac:CMD + /)
  • Others (understand):

    • Multi cursor operation: Ctrl click (Mac:CMD + click)
    • Fallback: Ctrl+Z (Mac:CMD+Z)
    • Redo: Ctrl+Y (Mac:CMD+Y)

Matplotlib

  • Drawing image flow

    • 1. Create canvas – plt.figure(figsize=(20,8))
    • 2. Draw image – plot.plot (x, y)
    • 3. Display image – plt.show()
    • Image saving – plt.savefig()
      Note: images must be saved before show
  • Add x,y axis scale [know]

    • plt.xticks()
    • plt.yticks()
    • Note: the first parameter passed in must be a number, not a string. If it is a string, you need to replace it
  • Add grid display [know]

    • plt.grid(linestyle="–", alpha=0.5)
  • Add description [know]

    • plt.xlabel()
    • plt.ylabel()
    • plt.title()
  • Image saving [know]

    • plt.savefig("path")
  • plot multiple times [understand]

    • Just add it directly
  • Display legend [know]

    • plt.legend(loc="best")
    • Note: a label must be set in plot. Plot(). If it is not set, it cannot be displayed
  • Multiple coordinate system display [understand]

    • plt.subplots(nrows=, ncols=)
  • Application of line chart [know]

    • 1. Apply to changes in observed data
    • 2. But draw some mathematical function images

Numpy

When ndarray stores data, the data and data addresses are continuous, which makes it faster to batch operate array elements.

This is because the types of all elements in the ndarray are the same, and the element types in the python list are arbitrary, so the memory of the ndarray can be continuous when storing elements, while the python native list can only find the next element through addressing. Although this also leads to the fact that the ndarray of Numpy is not as good as the python native list in terms of general performance, in scientific computing, Numpy's n Darray can eliminate many circular statements, and the code is much simpler than Python's native list.

Attribute nameAttribute interpretation
ndarray.shapeTuple of array dimension
ndarray.ndimArray dimension
ndarray.sizeNumber of elements in the array
ndarray.itemsizeLength of an array element (bytes)
ndarray.dtypeType of array element

1 method of generating array

1.1 generate an array of 0 and 1

  • np.ones(shape, dtype)
  • np.ones_like(a, dtype)
  • np.zeros(shape, dtype)
  • np.zeros_like(a, dtype)

1.2 generate from existing arrays

1.2.1 generation method

  • np.array(object, dtype)
  • np.asarray(a, dtype)
a = np.array([[1,2,3],[4,5,6]])
# Create from an existing array
a1 = np.array(a)
# Equivalent to the form of an index, there is no real creation of a new one
a2 = np.asarray(a)

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-gl2af3bl-1632804575531) (machine learning process record. Differences between assets/array and asarray. png)]

1.3 generate fixed range array

1.3.1 np.linspace (start, stop, num, endpoint)

  • Create an isometric array - specify the number
  • Parameters:
    • start: the starting value of the sequence
    • stop: the end value of the sequence
    • num: the number of equally spaced samples to be generated. The default value is 50
    • endpoint: whether the sequence contains the stop value. The default value is true
# Generate equally spaced arrays
np.linspace(0, 100, 11)

Return result:

array([  0.,  10.,  20.,  30.,  40.,  50.,  60.,  70.,  80.,  90., 100.])

1.3.2 np.arange(start,stop, step, dtype)

  • Create an isometric array - specify the step size
  • parameter
    • Step: step size. The default value is 1
np.arange(10, 50, 2)

Return result:

array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42,
       44, 46, 48])

1.3.3 np.logspace(start,stop, num)

  • Create an isometric sequence
  • Parameters:
    • num: the number of proportional series to be generated. The default value is 50
# Generate 10^x
np.logspace(0, 2, 3)

Return result:

array([  1.,  10., 100.])

1.4 generating random arrays

1.4.1 introduction to using module

  • np.random module

1.4.2 normal distribution

  • np.random.randn(d0, d1, ..., dn)

    Function: return one or more sample values from standard normal distribution

  • np.random.normal(*loc=0.0*, *scale=1.0*, *size=None*)

    loc: float

    The mean value of this probability distribution (corresponding to the center of the whole distribution)

    scale: float

    The standard deviation of this probability distribution (corresponding to the width of the distribution, the larger the scale, the fatter, and the smaller the scale, the thinner)

    size: int or tuple of ints

    The output shape is None by default, and only one value is output

  • np.random.standard_normal(size=None)

    Returns an array of standard normal distributions for a specified shape.

1.4.2 uniform distribution

  • np.random.rand(d0,d1,...,dn)
    • Returns a group of evenly distributed numbers in * * [0.0, 1.0) * *.
  • np.random.uniform(low=0.0, high=1.0, size=None)
    • Function: randomly sample from a uniform distribution [low, high]. Note that the definition domain is closed on the left and open on the right, that is, it contains low and does not contain high
    • Parameter introduction:
      • low: sampling lower bound, float type, the default value is 0;
      • high: sampling upper bound, float type, the default value is 1;
      • Size: the number of output samples, which is of type int or tuple. For example, if size=(m,n,k), mnk samples will be output, and 1 value will be output by default.
    • Return value: ndarray type, whose shape is consistent with that described in the parameter size.
  • np.random.randint(low,high=None,size=None,dtype='l')
    • Randomly sample from a uniform distribution to generate an integer or N-dimensional integer array,
    • Access range: if high is not None, the random integer between [low, high]) is taken; otherwise, the random integer between [0, low]) is taken.

3 shape modification

3.1 ndarray.reshape(shape, order)

  • Returns a view with the same data field but different shape s
  • Rows and columns are not interchangeable
# When converting shapes, be sure to pay attention to the element matching of the array
stock_change.reshape([5, 4])
stock_change.reshape([-1,10])  # The shape of the array is modified to: (2, 10), - 1: indicates that it passes through the to be calculated

3.2 ndarray.resize(new_shape)

  • Modify the shape of the array itself (keep the number of elements the same before and after)
  • Rows and columns are not interchangeable
stock_change.resize([5, 4])

# View modified results
stock_change.shape
(5, 4)

3.3 ndarray.T

  • Transpose of array
  • Exchange the rows and columns of the array
stock_change.T.shape
(4, 5)

4 type modification

4.1 ndarray.astype(type)

  • Returns the array after the type is modified
stock_change.astype(np.int32)

4.2 ndarray.tostring([order]) or ndarray.tobytes([order])

  • Construct Python bytes that contain the original data bytes in the array
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[12, 3, 34], [5, 6, 7]]])
arr.tostring()

5 array de duplication

5.1 np.unique()

temp = np.array([[1, 2, 3, 4],[3, 4, 5, 6]])
>>> np.unique(temp)
array([1, 2, 3, 4, 5, 6])

1 logic operation

# Generate data for 10 students and 5 courses
>>> score = np.random.randint(40, 100, (10, 5))

# Take out the scores of the last four students for logical judgment
>>> test_score = score[6:, 0:5]

# Logical judgment. If the score is greater than 60, it is marked as True; otherwise, it is False
>>> test_score > 60
array([[ True,  True,  True, False,  True],
       [ True,  True,  True, False,  True],
       [ True,  True, False, False,  True],
       [False,  True,  True,  True,  True]])

# BOOL assignment, which sets the satisfying condition to the specified value - Boolean index
>>> test_score[test_score > 60] = 1
>>> test_score
array([[ 1,  1,  1, 52,  1],
       [ 1,  1,  1, 59,  1],
       [ 1,  1, 44, 44,  1],
       [59,  1,  1,  1,  1]])

2 general judgment function

  • np.all()
# Judge whether the first two students have passed [0:2,:]
>>> np.all(score[0:2, :] > 60)
False
  • np.any()
# Judge whether the score of the first two students [0:2,:] is greater than 90
>>> np.any(score[0:2, :] > 80)
True

3 np.where (ternary operator)

More complex operations can be performed by using np.where

  • np.where()
# Judge the top four students. In the top four courses, the score greater than 60 is set as 1, otherwise it is 0
temp = score[:4, :4]
np.where(temp > 60, 1, 0)
  • Composite logic needs to be used in combination with np.logical_and and np.logical_or
# Judge the top four students. In the first four courses, the score greater than 60 and less than 90 is changed to 1, otherwise it is 0
np.where(np.logical_and(temp > 60, temp < 90), 1, 0)

# Judge the top four students. In the first four courses, the score greater than 90 or less than 60 is changed to 1, otherwise it is 0
np.where(np.logical_or(temp > 90, temp < 60), 1, 0)

4 statistical operation

What should I do if I want to know the student's maximum score or make a small score?

4.1 statistical indicators

In the field of data mining / machine learning, the value of statistical indicators is also a way for us to analyze problems. The commonly used indicators are as follows:

  • min(a, axis)
    • Return the minimum of an array or minimum along an axis.
  • max(a, axis])
    • Return the maximum of an array or maximum along an axis.
  • median(a, axis)
    • Compute the median along the specified axis.
  • mean(a, axis, dtype)
    • Compute the arithmetic mean along the specified axis.
  • std(a, axis, dtype)
    • Compute the standard deviation along the specified axis.
  • var(a, axis, dtype)
    • Compute the variance along the specified axis.

4.2 case: statistical calculation of student achievement

During statistics, the value of axis is not necessarily the same. The values of different API axes in Numpy are different. Here, axis 0 represents a column and axis 1 represents a row for statistics

# Next, for the top four students, do some statistical operations
# Specify column de statistics
temp = score[:4, 0:5]
print("Top four students,Maximum score of each subject:{}".format(np.max(temp, axis=0)))
print("Top four students,Minimum score of each subject:{}".format(np.min(temp, axis=0)))
print("Top four students,Performance fluctuation of each subject:{}".format(np.std(temp, axis=0)))
print("Top four students,Average score of each subject:{}".format(np.mean(temp, axis=0)))

result:

Top four students,Maximum score of each subject:[96 97 72 98 89]
Top four students,Minimum score of each subject:[55 57 45 76 77]
Top four students,Performance fluctuation of each subject:[16.25576821 14.92271758 10.40432602  8.0311892   4.32290412]
Top four students,Average score of each subject:[78.5  75.75 62.5  85.   82.25]

If you need to calculate which student has the highest score in a subject?

  • np.argmax(temp, axis=)
  • np.argmin(temp, axis=)
print("For the top four students, the subscript of the student with the highest score in each subject:{}".format(np.argmax(temp, axis=0)))

result:

For the top four students, the subscript of the student with the highest score in each subject:[0 2 0 0 1]

Differences between np.matmul and np.dot:

Both are matrix multiplication. Matrix and scalar multiplication are prohibited in np.matmul. Np.matmul is no different from np.dot in the inner product operation of vector multiplication vector.

Pandas

There are three data structures in Pandas: Series, DataFrame and MultiIndex (called Panel in the old version).

Series is a one-dimensional data structure, DataFrame is a two-dimensional tabular data structure, and MultiIndex is a three-dimensional data structure.

1.Series

Series is a data structure similar to one-dimensional array. It can store any type of data, such as integers, strings, floating-point numbers, etc. it is mainly composed of a set of data and related indexes.

1.1 creation of series

# Import pandas
import pandas as pd

pd.Series(data=None, index=None, dtype=None)
  • Parameters:
    • Data: incoming data, which can be ndarray, list, etc
    • Index: the index must be unique and equal to the length of the data. If no index parameter is passed in, an integer index from 0 to n will be automatically created by default.
    • dtype: type of data

Create from existing data

  • Specify content, default index
pd.Series(np.arange(10))
# Operation results
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64
  • Specify index
pd.Series([6.7,5.6,3,10,2], index=[1,2,3,4,5])
# Operation results
1     6.7
2     5.6
3     3.0
4    10.0
5     2.0
dtype: float64
  • Create from dictionary data
color_count = pd.Series({'red':100, 'blue':200, 'green': 500, 'yellow':1000})
color_count
# Operation results
blue       200
green      500
red        100
yellow    1000
dtype: int64

1.2 Series properties

In order to more easily manipulate the indexes and data in the Series object, two attributes index and values are provided in Series

  • index
color_count.index

# result
Index(['blue', 'green', 'red', 'yellow'], dtype='object')
  • values
color_count.values

# result
array([ 200,  500,  100, 1000])

You can also use indexes to get data:

color_count[2]

# result
100

2.DataFrame

DataFrame is an object similar to a two-dimensional array or table (such as excel), with both row and column indexes

  • Row index indicates different rows. The horizontal index is called index, 0 axis, axis=0
  • Column index, columns with different table names, vertical index, called columns, 1 axis, axis=1

2.1 creation of dataframe

# Import pandas
import pandas as pd

pd.DataFrame(data=None, index=None, columns=None)
  • Parameters:
    • Index: row label. If no index parameter is passed in, an integer index from 0-N will be automatically created by default.
    • columns: column label. If no index parameter is passed in, an integer index from 0-N will be automatically created by default.
  • Create from existing data

Example 1:

pd.DataFrame(np.random.randn(2,3))

Example 2: create a student transcript

# Generate data for 10 students and 5 courses
score = np.random.randint(40, 100, (10, 5))

# result
array([[92, 55, 78, 50, 50],
       [71, 76, 50, 48, 96],
       [45, 84, 78, 51, 68],
       [81, 91, 56, 54, 76],
       [86, 66, 77, 67, 95],
       [46, 86, 56, 61, 99],
       [46, 95, 44, 46, 56],
       [80, 50, 45, 65, 57],
       [41, 93, 90, 41, 97],
       [65, 83, 57, 57, 40]])

However, in this data form, it is difficult to see what kind of data is stored, and the readability is poor!!

Question: how to make the data more meaningful?

# Using data structures in Pandas
score_df = pd.DataFrame(score)
  • Add row and column index
# Construct row index sequence
subjects = ["chinese", "mathematics", "English", "Politics", "Sports"]

# Construct column index sequence
stu = ['classmate' + str(i) for i in range(score_df.shape[0])]

# Add row index
data = pd.DataFrame(score, columns=subjects, index=stu)

2.2 DataFrame properties

  • shape
data.shape

# result
(10, 5)
  • index

Row index list of DataFrame

data.index

# result
Index(['Classmate 0', 'Classmate 1', 'Classmate 2', 'Classmate 3', 'Classmate 4', 'Classmate 5', 'Classmate 6', 'Classmate 7', 'Classmate 8', 'Classmate 9'], dtype='object')
  • columns

Column index list for DataFrame

data.columns

# result
Index(['chinese', 'mathematics', 'English', 'Politics', 'Sports'], dtype='object')
  • values

Get the value of array directly

data.values

array([[92, 55, 78, 50, 50],
       [71, 76, 50, 48, 96],
       [45, 84, 78, 51, 68],
       [81, 91, 56, 54, 76],
       [86, 66, 77, 67, 95],
       [46, 86, 56, 61, 99],
       [46, 95, 44, 46, 56],
       [80, 50, 45, 65, 57],
       [41, 93, 90, 41, 97],
       [65, 83, 57, 57, 40]])
  • T

Transpose

data.T
  • head(5): displays the first 5 lines

If no parameter is added, the default is 5 lines. If parameter N is filled in, the first N lines will be displayed

data.head(5)
  • tail(5): displays the contents of the last 5 lines

If no parameters are added, the default value is 5 lines. If parameter N is filled in, the last N lines will be displayed

data.tail(5)

2.3 setting of datatframe index

2.3.1 modifying row and column index values

stu = ["student_" + str(i) for i in range(score_df.shape[0])]

# It must be modified as a whole
data.index = stu

Note: the following modification methods are wrong

# Error modification method
data.index[3] = 'student_3'

2.3.2 reset index

  • reset_index(drop=False)
    • Set new subscript index
    • drop: the default value is False. The original index will not be deleted. If True, the original index value will be deleted
# Reset index, drop=False
data.reset_index()

2.3.3 set a column value as a new index

  • set_index(keys, drop=True)
    • keys: list of column index names or column index names
    • Drop: Boolean, default true. Delete the original column as a new index

Set new index case

1. Create

df = pd.DataFrame({'month': [1, 4, 7, 10],
                    'year': [2012, 2014, 2013, 2014],
                    'sale':[55, 40, 84, 31]})

   month  sale  year
0  1      55    2012
1  4      40    2014
2  7      84    2013
3  10     31    2014

2. Set new index by month

df.set_index('month')
       sale  year
month
1      55    2012
4      40    2014
7      84    2013
10     31    2014

3. Set multiple indexes to year and month

df = df.set_index(['year', 'month'])
df
            sale
year  month
2012  1     55
2014  4     40
2013  7     84
2014  10    31

Note: through the setting just now, the DataFrame becomes a DataFrame with MultiIndex.

3.MultiIndex and Panel

3.1 MultiIndex

MultiIndex is a three-dimensional data structure;

Multi level index (also known as hierarchical index) is an important function of pandas. It can have two or more indexes on Series and DataFrame objects.

3.1.1 characteristics of multiindex

Print the row index result of df just now

df.index

MultiIndex(levels=[[2012, 2013, 2014], [1, 4, 7, 10]],
           labels=[[0, 2, 1, 2], [0, 1, 2, 3]],
           names=['year', 'month'])

Multi level or hierarchical index objects.

  • index attribute
    • Names: the name of the levels
    • levels: tuple value of each level
df.index.names
# FrozenList(['year', 'month'])

df.index.levels
# FrozenList([[1, 2], [1, 4, 7, 10]])

3.1.2 creation of multiindex

arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))

# result
MultiIndex(levels=[[1, 2], ['blue', 'red']],
           codes=[[0, 0, 1, 1], [1, 0, 1, 0]],
           names=['number', 'color'])

3.2 Panel

3.2.1 panel creation

  • class pandas.Panel(data=None, items=None, major_axis=None, minor_axis=None)
    • Function: store Panel structure of 3D array
    • Parameters:
      • Data: ndarray or dataframe
      • items: index or array like object, axis=0
      • major_axis: index or array like object, axis=1
      • minor_axis: index or array like object, axis=2
p = pd.Panel(data=np.arange(24).reshape(4,3,2),
                 items=list('ABCD'),
                 major_axis=pd.date_range('20130101', periods=3),
                 minor_axis=['first', 'second'])

# result
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: A to D
Major_axis axis: 2013-01-01 00:00:00 to 2013-01-03 00:00:00
Minor_axis axis: first to second

3.2.2 viewing panel data

p[:,:,"first"]
p["B",:,:]

Note: Pandas has been deprecated since version 0.20.0: the recommended method for representing 3D data is through the MultiIndex method on the DataFrame

4 Summary

  • Advantages of pandas [understand]
    • Enhance chart readability
    • Convenient data processing capability
    • Easy to read files
    • It encapsulates the drawing and calculation of Matplotlib and Numpy
  • series [know]
    • establish
      • pd.Series([], index=[])
      • pd.Series({})
    • attribute
      • Object.index
      • Object.values
  • DataFrame [Master]
    • establish
      • pd.DataFrame(data=None, index=None, columns=None)
    • attribute
      • Shape – shape
      • Index – row index
      • columns – column index
      • Values – view values
      • T – transpose
      • head() – view the contents of the header
      • tail() – view the tail content
    • DataFrame index
      • Global modification is required when modifying
      • Object.reset_index()
      • Object.set_index(keys)
  • MultiIndex and Panel [understand]
    • multiIndex:
      • Similar to 3D array in ndarray
      • establish:
        • pd.MultiIndex.from_arrays()
      • Properties:
        • Object.index
    • panel:
      • pd.Panel(data, items, major_axis, minor_axis)
      • If you want to see the panel data, you need to index it to dataframe or series

1 index operation

In Numpy, we have talked about using index to select sequence and slice selection. pandas also supports similar operations. You can also directly use column names and row names

Scale, or even combination.

1.1 direct use of row column index (column before row)

Get the result of 'close' on February 27, 2018

# Direct use of row column index names (column first, row second)
data['open']['2018-02-27']
23.53

# Unsupported operation
# error
data['2018-02-27']['open']
# error
data[:1, :2]

1.2 using indexes in combination with loc or iloc

Get the results from 'February 27, 2018': 'February 22, 2018', 'open'

# Using loc: only the name of the row column index can be specified
data.loc['2018-02-27':'2018-02-22', 'open']

2018-02-27    23.53
2018-02-26    22.80
2018-02-23    22.88
Name: open, dtype: float64

# Using iloc, you can get it through the index subscript
# Obtain the data of the first 3 days and the results of the first 5 columns
data.iloc[:3, :5]

            open    high    close    low
2018-02-27    23.53    25.88    24.16    23.53
2018-02-26    22.80    23.78    23.53    22.80
2018-02-23    22.88    23.37    22.82    22.71

2 assignment operation

Reassign the close column in the DataFrame to 1

# Modify the original value directly
data['close'] = 1
# perhaps
data.close = 1

3 sorting

There are two forms of sorting, one is to sort the index and the other is to sort the content

3.1 DataFrame sorting

  • Use df.sort_values(by=, ascending=)
    • Single key or multiple keys to sort,
    • Parameters:
      • by: Specifies the key to sort the reference
      • Ascending: default ascending
        • ascending=False: descending
        • ascending=True: ascending
# Sort by the opening price. Use ascending to specify sorting by size
data.sort_values(by="open", ascending=True).head()
# Sort by multiple keys
data.sort_values(by=['open', 'high'])
  • Use df.sort_index sorts the indexes

The date index of this stock was originally from large to small, but now it is reordered from small to large

# Sort indexes
data.sort_index()

3.2 Series sorting

  • Use series.sort_values(ascending=True)

When sorting series, there is only one column and no parameters are required

data['p_change'].sort_values(ascending=True).head()

2015-09-01   -10.03
2015-09-14   -10.02
2016-01-11   -10.02
2015-07-15   -10.02
2015-08-26   -10.01
Name: p_change, dtype: float64
  • Use series.sort_index()

Consistent with df

# Sort indexes
data['p_change'].sort_index().head()

2015-03-02    2.62
2015-03-03    1.44
2015-03-04    1.57
2015-03-05    2.02
2015-03-06    8.51
Name: p_change, dtype: float64

4 Summary

  • 1. Index [Master]
    • Direct index - column before row, which is obtained through the indexed string
    • loc – the first column and the last column. It is a string that needs to be indexed
    • iloc – the first and last columns are indexed by subscripts
    • ix – first column and last column, which can be indexed by mixing the above two methods
  • 2. Assignment [know]
    • data[""] = **
    • data. =
  • 3. Sort [know]
    • dataframe
      • Object.sort_values()
      • Object.sort_index()
    • series
      • Object.sort_values()
      • Object.sort_index()

1 arithmetic operation

  • add(other)

For example, a mathematical operation plus a specific number

data['open'].add(1)

2018-02-27    24.53
2018-02-26    23.80
2018-02-23    23.88
2018-02-22    23.25
2018-02-14    22.49
  • sub(other)'

2 logic operation

2.1 logical operation symbols

  • For example, filter the date data of data ["open"] > 23
    • data ["open"] > 23 return logical results
data["open"] > 23

2018-02-27     True
2018-02-26    False
2018-02-23    False
2018-02-22    False
2018-02-14    False
# The results of logical judgment can be used as the basis for screening
data[data["open"] > 23].head()
  • Complete multiple logical judgments,
data[(data["open"] > 23) & (data["open"] < 24)].head()

2.2 logic operation function

  • query(expr)
    • expr: query string

query makes the process more convenient and simple

data.query("open<24 & open>23").head()
  • isin(values)

For example, judge whether 'open' is 23.53 and 23.85

# You can specify a value to make a judgment, so as to filter
data[data["open"].isin([23.53, 23.85])]

3 statistical operation

3.1 describe

Comprehensive analysis: many statistical results can be obtained directly, such as count, mean, std, min, max, etc

# Calculate the mean, standard deviation, maximum and minimum values
data.describe()

When a single function is used for statistics, the coordinate axis is still the default column "columns" (axis=0, default). If the row "index" needs to be specified (axis=1)

  • max(),min()

  • max(),min()

# Use the statistical function: 0 represents the column calculation result, and 1 represents the row calculation result
data.max(0)

1 pandas.DataFrame.plot

  • DataFrame.plot(kind='line')
  • Kind: STR, the type of graphics to be drawn
    • 'line' : line plot (default)
    • 'bar' : vertical bar plot
    • 'barh' : horizontal bar plot
      • Interpretation of "barh":
      • http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.barh.html
    • 'hist' : histogram
    • 'pie' : pie plot
    • 'scatter' : scatter plot

1 CSV

1.1 read_csv

  • pandas.read_csv(filepath_or_buffer, sep =',', usecols )
    • filepath_or_buffer: file path
    • sep: separator, separated by "," by default
    • usecols: Specifies the column name to read, in list form
  • Example: read the data of previous stocks
# Read the file and specify that only 'open' and 'close' indicators are obtained
data = pd.read_csv("./data/stock_day.csv", usecols=['open', 'close'])

            open    close
2018-02-27    23.53    24.16
2018-02-26    22.80    23.53
2018-02-23    22.88    22.82
2018-02-22    22.25    22.28
2018-02-14    21.49    21.92

1.2 to_csv

  • DataFrame.to_csv(path_or_buf=None, sep=', ', columns=None, header=True, index=True, mode='w', encoding=None)
    • path_or_buf: file path
    • sep: separator, separated by "," by default
    • columns: select the desired column index
    • Header: Boolean or list of string, default true, whether to write into the column index value
    • Index: write index
    • mode: 'w': Rewrite, 'a' append
  • Example: save the read stock data
    • Save the data in the 'open' column, and then read and view the results
# Select 10 rows of data to save for easy observation
data[:10].to_csv("./data/test.csv", columns=['open'])
# Read and view results
pd.read_csv("./data/test.csv")

     Unnamed: 0    open
0    2018-02-27    23.53
1    2018-02-26    22.80
2    2018-02-23    22.88
3    2018-02-22    22.25
4    2018-02-14    21.49
5    2018-02-13    21.40
6    2018-02-12    20.70
7    2018-02-09    21.20
8    2018-02-08    21.79
9    2018-02-07    22.69

You will find that the index is stored in a file and becomes a separate column of data. If you need to delete, you can specify the index parameter to delete the original file and save it again.

# Index: the storage will not change the index value into a column of data
data[:10].to_csv("./data/test.csv", columns=['open'], index=False)

1 how to deal with nan

  • Tag method for obtaining missing values (NaN or other tag methods)
  • If the missing value is marked with NaN
    • Judge whether NaN is included in the data:
      • pd.isnull(df),
      • pd.notnull(df)
    • Missing value nan:
      • 1. Delete dropna with missing value (axis = 'rows')
        • Note: the original data will not be modified, and the return value needs to be accepted
      • 2. Replace missing value: fillna(value, inplace=True)
        • Value: replace with the value of
        • inplace:True: the original data will be modified, False: the original data will not be replaced and modified, and a new object will be generated
  • If the missing value is not marked with NaN, such as "?"
    • Replace '?' first Set to np.nan and continue processing

2.2 there is a missing value Nan and it is np.nan

  • 1. Delete

pandas deletes missing values. The premise of using dropna is that the type of missing values must be np.nan

# Do not modify original data
movie.dropna()

# You can define a new variable to accept or use the original variable name
data = movie.dropna()
  • 2. Replace missing values
# Replace two columns of samples with missing values
# Replace fill average, median
# movie['Revenue (Millions)'].fillna(movie['Revenue (Millions)'].mean(), inplace=True)

Replace all missing values:

for i in movie.columns:
    if np.all(pd.notnull(movie[i])) == False:
        print(i)
        movie[i].fillna(movie[i].mean(), inplace=True)

Analysis of treatment ideas:

  • 1. Replace '?' first np.nan
    • df.replace(to_replace=, value=)
      • to_replace: value before replacement
      • Value: value after replacement
# Replace the missing values marked by some other values with np.nan
wis = wis.replace(to_replace='?', value=np.nan)
  • 2. Missing values are being processed
# delete
wis = wis.dropna()
  • pd.qcut(data, q):
    • Group the data. Group the data, usually with value_ Count is used together to count the number of each group
  • series.value_counts(): counts the grouping times

User defined interval grouping:

  • pd.cut(data, bins)

pd.concat to realize data merging

  • pd.concat([data1, data2], axis=1)
    • Merge by row or column. axis=0 is the column index and axis=1 is the row index

pd.merge

  • pd.merge(left, right, how='inner', on=None)
    • You can specify to merge or separate the left and right according to the common key value pairs of the two sets of data
    • left: DataFrame
    • right: another DataFrame
    • on: specified common key
    • How: how to connect

kernel function

Mapping data to high-dimensional representation to simplify the classification problem requires the use of kernel trick, which is named for this core idea.
The basic idea is: to find a good decision hyperplane in the new representation space, you do not need to directly calculate the coordinates of points in the new space, but only need to calculate the distance between point pairs in the new space, and this calculation can be completed efficiently by using kernel function. Kernel function is a computationally realizable operation, which maps any two points in the original space to the distance between these two points in the target representation space, completely avoiding the direct calculation of the new representation. The kernel function is usually selected artificially, not learned from the data - for SVM, only the segmentation hyperplane is obtained by learning.

Before deep learning, people need to spend a lot of time on Feature Engineering, that is, processing data, so that the input data is more suitable for the corresponding methods

Dropout

Dropout: when propagating forward, let the activation value of a neuron stop working with a certain probability p, which can make the model more generalized, because it will not rely too much on some local features. [external chain picture transfer failed, and the source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-cj5krzyz-1632804575537) (machine learning process record. Assets / watermark, type_zmfuz3pozw5nagvpdgk, shadow_10, text_ahr0chm6ly9ibg9nlmnzg4ubmv0l3fxzqxnji3njqy, size_16, color_ffff, t_70. JPEG)]

convolution

The patterns learned by convolutional neural network have translation invariance. After learning a pattern in the lower right corner of the image, the convolutional neural network can recognize the pattern anywhere, such as the upper left corner. For densely connected networks, if the pattern appears in a new location, it can only relearn the pattern. This makes the convolutional neural network can make efficient use of data when processing images (because the visual world is fundamentally translation invariant). It only needs fewer training samples to learn the data representation with generalization ability.

word2vec

word2vec: it can be understood as a dimension reduction process for the word onehot vector. An n-dimensional onehot vector is transformed into an m-dimensional spatial real vector through a mapping relationship (it can be understood that the points on the original coordinate axis are compressed and embedded into a more compact space). Due to the particularity of onehot vector in matrix multiplication, Each k line in the n*m matrix representing the mapping relationship we get actually represents the k-th word in the corpus.

There are two main ways to train words in the corpus by using this processing method of spatial compression and dimensionality reduction

Skip gram neural network training model: a fully connected neural network with hidden layer 1, and there is no activation function in the hidden layer. The output layer uses softmax classifier to output probability. The input is a word, the output is the probability that each word is the context of the input word, and the real value is a word in the context of the input word.

CBOW: the principle is similar to skip gram, but the input is context information and the output is the central word in the information.

Adaboost

AdaBoost's add means adaptive.

The operation process is as follows: each sample in the training data is given a weight, which constitutes the question D. At first, the weights are initialized to equal values. First, a weak classifier is trained on the training data and the error rate of the classifier is calculated, and then the weak classifier is trained again on the same data set. In the second training of the classifier, the weight of each sample will be readjusted, in which the weight of the first right sample will be reduced, and the weight of the first wrong sample will be increased. In order to get the final classification results from all weak classifiers, AdaBoost assigns a weight value alpha to each classifier, which is calculated based on the error rate of each weak classifier.

Like other Boosting algorithms, Gradient Boosting integrates several models with general performance (usually decision trees with fixed depth) into a better model. Abstractly speaking, the training process of the model is the optimization process of an arbitrary differentiable objective function. By repeatedly selecting a function pointing to the negative gradient direction, the algorithm can be regarded as optimizing the objective function in the function space. Therefore, it can be said that Gradient Boosting = Gradient Descent + Boosting.

Like AdaBoost, Gradient Boosting repeatedly selects a model with general performance and adjusts it based on the performance of the previous model each time. The difference is that AdaBoost locates the deficiency of the model by increasing the weight of misdivided data points, while Gradient Boosting locates the deficiency of the model by calculating gradient. Therefore, Gradient Boosting can use more kinds of objective functions than AdaBoost.

The difference between loc and iloc

pandas obtains the value of A column in A dictionary like manner, such as df ['A'], which will get the A column of df. What if we're interested in something? At this time, there are two methods, one is iloc method, the other is loc method. loc means location, and i in iloc means integer. The differences between the two are as follows:

loc: works on labels in the index.
iloc: works on the positions in the index (so it only takes integers).

In other words, loc indexes according to the index. For example, if df below defines an index, loc indexes the corresponding rows according to the index. iloc is not indexed according to the index, but according to the line number. The line number starts from 0 and adds 1 one by one.

In [1]: df = DataFrame(randn(5,2),index=range(0,10,2),columns=list('AB'))

In [2]: df
Out[2]: 
          A         B
0  1.068932 -0.794307
2 -0.470056  1.192211
4 -0.284561  0.756029
6  1.037563 -0.267820
8 -0.538478 -0.800654

In [5]: df.iloc[[2]]
Out[5]: 
          A         B
4 -0.284561  0.756029

In [6]: df.loc[[2]]
Out[6]: 
          A         B
2 -0.470056  1.192211

Each sample in the training data is given a weight, which constitutes the question D. At first, the weights are initialized to equal values. First, a weak classifier is trained on the training data and the error rate of the classifier is calculated, and then the weak classifier is trained again on the same data set. In the second training of the classifier, the weight of each sample will be readjusted, in which the weight of the first right sample will be reduced, and the weight of the first wrong sample will be increased. In order to get the final classification results from all weak classifiers, AdaBoost assigns a weight value alpha to each classifier, which is calculated based on the error rate of each weak classifier.

Like other Boosting algorithms, Gradient Boosting integrates several models with general performance (usually decision trees with fixed depth) into a better model. Abstractly speaking, the training process of the model is the optimization process of an arbitrary differentiable objective function. By repeatedly selecting a function pointing to the negative gradient direction, the algorithm can be regarded as optimizing the objective function in the function space. Therefore, it can be said that Gradient Boosting = Gradient Descent + Boosting.

Like AdaBoost, Gradient Boosting repeatedly selects a model with general performance and adjusts it based on the performance of the previous model each time. The difference is that AdaBoost locates the deficiency of the model by increasing the weight of misdivided data points, while Gradient Boosting locates the deficiency of the model by calculating gradient. Therefore, Gradient Boosting can use more kinds of objective functions than AdaBoost.

The difference between loc and iloc

pandas obtains the value of A column in A dictionary like manner, such as df ['A'], which will get the A column of df. What if we're interested in something? At this time, there are two methods, one is iloc method, the other is loc method. loc means location, and i in iloc means integer. The differences between the two are as follows:

loc: works on labels in the index.
iloc: works on the positions in the index (so it only takes integers).

In other words, loc indexes according to the index. For example, if df below defines an index, loc indexes the corresponding rows according to the index. iloc is not indexed according to the index, but according to the line number. The line number starts from 0 and adds 1 one by one.

In [1]: df = DataFrame(randn(5,2),index=range(0,10,2),columns=list('AB'))

In [2]: df
Out[2]: 
          A         B
0  1.068932 -0.794307
2 -0.470056  1.192211
4 -0.284561  0.756029
6  1.037563 -0.267820
8 -0.538478 -0.800654

In [5]: df.iloc[[2]]
Out[5]: 
          A         B
4 -0.284561  0.756029

In [6]: df.loc[[2]]
Out[6]: 
          A         B
2 -0.470056  1.192211

Topics: Python Machine Learning linear algebra