Reference link: http://blog.sina.com.cn/s/blog_77f476ef0102xid5.html
1 import statement
2 File Reading
3 Data Preprocessing
4 Data Screening
5 Data Operation and Sorting
6 Mathematical Statistics
1 import statement
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
import re
2 File Reading
df = pd.read_csv(path='file.csv')
Parameter: header=None with default column name, 0, 1, 2, 3...
names=['A','B','C'... ] Custom Column Name
index_col='A'| ['A','B'... ] Give the index column a name, and if it is a multiple index, you can pass a list
Skprows= [0, 1, 2] The line number to be skipped starts at the top 0 of the file and skip_footer starts at the end of the file.
nrows=N rows to read, the first N rows
chunksize=M returns the iteration type TextFileReader, which is used once per M iteration when data occupies large memory.
sep=':'Data Separation Default is',', select the appropriate separator according to the file, if no parameters are specified, it will automatically parse.
Skp_blank_lines = False defaults to True, skipping empty lines and filling NaN if you choose not to skip
converters = {col1', func} use function func conversion for selected columns, usually indicating that numbered columns will be used (avoid converting to int)
dfjs = pd.read_json('file.json') can be passed in a JSON format string
dfex = pd.read_excel(`file.xls'), sheetname=[0,1... ]) Read multiple sheet pages and return multiple df Dictionaries
3 Data Preprocessing
df.duplicated() Returns whether a row is a duplicate row of the previous row
df.drop_duplicates() Delete duplicate rows. If you need to filter by column, fill in the parameters['col1', 'col2',...]
df.fillna(0) Fill in with real number 0 na
df.dropna() axis=0|1 0-index 1-column
how='all'|'any' all-All are NA Only delete any-As long as there is NA Delete all
del df['col1'] Delete a column directly
df.drop(['col1',...], aixs=1) Delete specified columns or rows
df.column = col_lst Renewal of listing
df.rename(index={'row1':'A'}, Rename index and column names
columns={'col1':'A1'})
df.replace(dict) replace df Values, before and after values can be used in a dictionary table.{1:'A', '2':'B'}
def get_digits(str):
m = re.match(r'(\d+(.\d+)?)', str.decode('utf-8'))
if m is not None:
return float(m.groups()[0])
else:
return 0
df.apply(get_digits) DataFrame.apply,Get only the decimal portion, and you can select a column or row
df['col1'].map(func) Series.map,Function conversion only for columns
pd.merge(df1, df2, on='col1',
how='inner',sort=True) Merge two DataFrame,Make internal connections (intersections) according to a common column. outter External connection (union), sorting of results
pd.merge(df1, df2, left_on='col1',
right_on='col2') df1 df2 There is no public column name, so merging requires specifying reference columns on both sides
pd.concat([sr1, sr2, sr3,...], axis=0) Multiple Series Stacked into multiple rows, the result is still one Series
pd.concat([sr1, sr2, sr3,...], axis=1) Multiple Series When combined into multiple rows and columns, the result is a DataFrame,Indexes are merged to fill in default values where there is no intersection NaN
df1.combine_first(df2) use df2 Data Supplement df1 Default value NaN,If df2 There are more lines to fill in.
df.stack() Columns are rotated into rows, that is, column names become index names, and original indexes become multi-level indexes, resulting in multi-level indexes. Series,It actually lengthens the data set.
df.unstack() Will contain multi-level index Series Convert to DataFrame,It's actually flattening the data set. If a column has fewer categories, pull them out as columns.
df.pivot() Actually, it is. unstack Application, Flattening Data Set
pd.get_dummies(df['col1'], prefix='key') A column contains a finite number of values, and these values are usually strings, such as countries. Using the idea of bitmaps, you can k The list of countries is quantified as k Columns, represented by 0, 1 for each column
4 Data Screening
df.columns Column name, return Index A collection of columns of type
df.index Index name, return Index A collection of types of indexes
df.shape Return tuple,That's ok x column
df.head(n=N) Before returning N strip
df.tail(n=M) After returning M strip
df.values A two-dimensional array of values to numpy.ndarray Object return
df.index DataFrame Indexes can not be modified by direct assignment
df.reindex(index=['row1', 'row2',...]
columns=['col1', 'col2',...]) Reordering according to the new index
df[m:n] Slice, select m~n-1 That's ok
df[df['col1'] > 1] Select rows that satisfy the criteria
df.query('col1 > 1') Select rows that satisfy the criteria
df.query('col1==[v1,v2,...]')
df.ix[:,'col1'] Select a column
df.ix['row1', 'col2'] Select an element
df.ix[:,:'col2'] Before slicing a column (including col2)All columns
df.loc[m:n] Get from m~n Line (Recommendation)
df.iloc[m:n] Get from m~n-1 That's ok
df.loc[m:n-1,'col1':'coln'] Get from mn Row col1coln column
sr=df['col'] Take a column and return Series
sr.values Series The value of numpy.ndarray Object return
sr.index Series The index of index Object return
5 Data Operation and Sorting
df.T DataFrame Transposition
df1 + df2 By adding indexes and columns, the union set is obtained. NaN Fill
df1.add(df2, fill_value=0) Fill in with other values
df1.add/sub//Method of mul/div four-rule operation
df - sr DataFrame All rows are subtracted at the same time Series
df * N All elements multiplied by N
df.add(sr, axis=0) DataFrame All columns are subtracted at the same time Series
sr.order() Series Ascending order
df.sort_index(aixs=0, ascending=True) Upgrading by row index
df.sort_index(by=['col1', 'col2'...]) Prioritize by specified column
df.rank() Computing ranking rank value
6 Mathematical Statistics
sr.unique Series Duplicate removal
sr.value_counts() Series Statistical frequency and sort from large to small. DataFrame There is no such method.
sr.describe() Returns basic statistics and quantiles
df.describe() Returns basic statistics and quantiles by column
df.count() Seeking non NA Worth the quantity
df.max() Maximum value
df.min() Maximum value
df.sum(axis=0) Sum by column
df.mean() Average value by column
df.median() Median
df.var() Variance estimation
df.std() Standard deviation
df.mad() Calculate the average absolute spread based on the average
df.cumsum() Cumulative sum
sr1.corr(sr2) Calculating correlation coefficient
df.cov() Finding Covariance Matrix
df1.corrwith(df2) Calculating correlation coefficient
pd.cut(array1, bins) Finding Interval Distribution of One-Dimensional Data
pd.qcut(array1, 4) Interval by specified quantile, 4 can be replaced by a custom quantile list
df['col1'].groupby(df['col2']) Column 1 is grouped by column 2, which is column 2 as key
df.groupby('col1') DataFrame Grouping by column 1
grouped.aggreagte(func) Aggregate after grouping according to the afferent function
grouped.aggregate([f1, f2,...]) Aggregate according to multiple functions, expressed as multiple columns, function name column name
grouped.aggregate([('f1_name', f1), ('f2_name', f2)]) Rename the aggregated column name
grouped.aggregate({'col1':f1, 'col2':f2,...}) Using aggregation of different functions for different columns, functions can also be multiple
df.pivot_table(['col1', 'col2'],
rows=['row1', 'row2'],
aggfunc=[np.mean, np.sum]
fill_value=0,
margins=True) according to row1, row2 Yes col1, col2 For grouping aggregation, aggregation methods can specify multiple types and replace default values with specified values
pd.crosstab(df['col1'], df['col2']) Crosstabulation, calculating the frequency of grouping