(I) pandas Foundation
1, Introduction to pandas
Python Data Analysis Library
Pandas is a tool based on NumPy, which was created to solve the task of data analysis. Pandas incorporates a large number of libraries and some standard data models, providing the tools needed to operate large structured data sets efficiently.
2, pandas core data structure
Data structure is the way that computer stores and organizes data. In general, carefully selected data structures can bring higher operation or storage efficiency. Data structure is often related to efficient retrieval algorithm and index technology.
(1),Series
Series can be understood as a one-dimensional array, but the index name can be changed by itself. Similar to a fixed length ordered dictionary, there are index and value.
import pandas as pd import numpy as np # Create an empty series s = pd.Series() # Create a series from ndarray data = np.array(['a','b','c','d']) s = pd.Series(data) s = pd.Series(data,index=[100,101,102,103]) # Create a series from a dictionary data = {'a' : 0., 'b' : 1., 'c' : 2.} s = pd.Series(data) # Create a series from scalar s = pd.Series(5, index=[0, 1, 2, 3])
To access data in Series:
# Retrieving elements using indexes s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e']) print(s[0], s[:3], s[-3:]) # Retrieving data using tags print(s['a'], s[['a','c','d']])
pandas date processing
# Date string format recognized by pandas dates = pd.Series(['2011', '2011-02', '2011-03-01', '2011/04/01', '2011/05/01 01:01:01', '01 Jun 2011']) # to_datetime() converts the date data type dates = pd.to_datetime(dates) print(dates, dates.dtype, type(dates)) # datetime type data supports date operation delta = dates - pd.to_datetime('1970-01-01') # Get the number of days print(delta.dt.days)
Series.dt provides many date related operations, as follows:
Series.dt.year The year of the datetime. Series.dt.month The month as January=1, December=12. Series.dt.day The days of the datetime. Series.dt.hour The hours of the datetime. Series.dt.minute The minutes of the datetime. Series.dt.second The seconds of the datetime. Series.dt.microsecond The microseconds of the datetime. Series.dt.week The week ordinal of the year. Series.dt.weekofyear The week ordinal of the year. Series.dt.dayofweek The day of the week with Monday=0, Sunday=6. Series.dt.weekday The day of the week with Monday=0, Sunday=6. Series.dt.dayofyear The ordinal day of the year. Series.dt.quarter The quarter of the date. Series.dt.is_month_start Indicates whether the date is the first day of the month. Series.dt.is_month_end Indicates whether the date is the last day of the month. Series.dt.is_quarter_start Indicator for whether the date is the first day of a quarter. Series.dt.is_quarter_end Indicator for whether the date is the last day of a quarter. Series.dt.is_year_start Indicate whether the date is the first day of a year. Series.dt.is_year_end Indicate whether the date is the last day of the year. Series.dt.is_leap_year Boolean indicator if the date belongs to a leap year. Series.dt.days_in_month The number of days in the month.
(2)DateTimeIndex
By specifying the period and frequency, use date The range () function creates a date series. By default, the frequency of the range is days.
import pandas as pd # Daily frequency datelist = pd.date_range('2019/08/21', periods=5) print(datelist) # Monthly frequency datelist = pd.date_range('2019/08/21', periods=5,freq='M') print(datelist) # Constructing a time series of an interval start = pd.datetime(2017, 11, 1) end = pd.datetime(2017, 11, 5) dates = pd.date_range(start, end) print(dates)
bdate_range() is used to represent the business date range, which is different from date_range(), which excludes Saturdays and Sundays.
import pandas as pd datelist = pd.bdate_range('2011/11/03', periods=5) print(datelist)
(3)DataFrame
DataFrame is a data type similar to a table. It can be understood as a two-dimensional array. The index has two dimensions and can be changed. DataFrame has the following characteristics:
- Potential columns are of different types
- Variable size
- Mark axis (row and column)
- You can perform arithmetic operations on rows and columns
import pandas as pd # Create an empty DataFrame df = pd.DataFrame() print(df) # Create DataFrame from list data = [1,2,3,4,5] df = pd.DataFrame(data) print(df) data = [['Alex',10],['Bob',12],['Clarke',13]] df = pd.DataFrame(data,columns=['Name','Age']) print(df) data = [['Alex',10],['Bob',12],['Clarke',13]] df = pd.DataFrame(data,columns=['Name','Age'],dtype=float) print(df) data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] df = pd.DataFrame(data) print(df) # Create DataFrame from dictionary data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} df = pd.DataFrame(data, index=['s1','s2','s3','s4']) print(df) data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(data) print(df)
(4) Core data structure operation
Column access
The single column data of DataFrame is a Series. According to the definition of DataFrame, we can know that DataFrame is a two-dimensional array with labels, and each label is equivalent to the column name of each column.
import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df['one']) print(df[['one', 'two']])
Column addition
The method of adding a column to the DataFrame is very simple. You only need to create a new column index. And assign values to the data under the index.
import pandas as pd data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} df = pd.DataFrame(data, index=['s1','s2','s3','s4']) df['score']=pd.Series([90, 80, 70, 60], index=['s1','s2','s3','s4']) print(df)
Column deletion
To delete a column of data, you need to use the method pop provided by pandas. The usage of pop method is as follows:
import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 'three' : pd.Series([10, 20, 30], index=['a', 'b', 'c'])} df = pd.DataFrame(d) print("dataframe is:") print(df) # Delete a column: one del(df['one']) print(df) #Call the pop method to delete a column df.pop('two') print(df)
Row access
If you only need to access the implementation of some rows of data in the DataFrame, you can select the array by using "::
import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df[2:4])
The loc method is a slicing method for DataFrame index names. The loc method is used as follows:
import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df.loc['b']) print(df.loc[['a', 'b']])
The difference between iloc and loc is that iloc must receive the positions of row index and column index. The iloc method is used as follows:
import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df.iloc[2]) print(df.iloc[[2, 3]])
Row add
import pandas as pd df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age']) df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age']) df = df.append(df2) print(df)
Row deletion
Use index tags to delete or delete rows from the DataFrame. If the labels are duplicate, multiple rows are deleted.
import pandas as pd df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age']) df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age']) df = df.append(df2) # Delete row with index 0 df = df.drop(0) print(df)
Modify data in DataFrame
The principle of changing the data in DataFrame is to extract this part of data and re assign it to new data.
import pandas as pd df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age']) df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age']) df = df.append(df2) df['Name'][0] = 'Tom' print(df)
DataFrame common properties
number | Property or method | describe |
---|---|---|
1 | axes | Returns a list of row / column labels (index es). |
2 | dtype | Returns the data type (dtype) of the object. |
3 | empty | Returns True if the series is empty. |
4 | ndim | Returns the dimension of the underlying data. Default definition: 1. |
5 | size | Returns the number of elements in the basic data. |
6 | values | Return series as ndarray. |
7 | head() | Returns the first n rows. |
8 | tail() | Returns the last n rows. |
Example code:
import pandas as pd data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} df = pd.DataFrame(data, index=['s1','s2','s3','s4']) df['score']=pd.Series([90, 80, 70, 60], index=['s1','s2','s3','s4']) print(df) print(df.axes) print(df['Age'].dtype) print(df.empty) print(df.ndim) print(df.size) print(df.values) print(df.head(3)) # First three lines of df print(df.tail(3)) # The last three lines of df
(II) Jupiter notebook
Jupiter notebook (formerly known as IPython notebook) is an interactive notebook that supports running more than 40 programming languages. Use the browser as the interface to send a request to the background IPython server and display the results. The essence of Jupiter notebook is a Web application, which is easy to create and share literary program documents, and supports real-time code, mathematical equations, visualization and markdown.
IPython is a python interactive shell, which is much easier to use than the default python shell. It supports automatic variable completion and automatic indentation, supports bash shell commands, and has built-in many useful functions and functions.
Install ipython
windows: the premise is numpy, Matplotlib and pandas
Install IPython with pip
OS X: Download and install Apple development tool Xcode from the app store.
Using easy_install or pip install IPython, or install from the source file.
Install Jupiter notebook
pip3 install jupyter