pandas is a third-party library that must be familiar with for data analysts. pandas has great advantages in scientific computing, especially for data analysts. There is Numpy in python, but Numpy is more mathematic. There is also a need for a database to represent the data model more specifically. We are very clear that EXCEL plays a very important role in data processing. The table model is the best presentation of the data model.
pandas is a simulation of table data model on python. It is simple like SQL for data processing and can be easily implemented on python.
Installation of pandas
The installation of pandas on python also uses pip:
pip install pandas
pandas Create Objects
pandas has two data structures: Series and DataFrame.
Series
Series is like a data list in python, where each data has its own index. Create Series from list.
>>> import pandas as pd >>> s1 = pd.Series([100,23,'bugingcode']) >>> s1 0 100 1 23 2 bugingcode dtype: object >>>
Add the corresponding index in Series:
>>> import numpy as np >>> ts = pd.Series(np.random.randn(365), index=np.arange(1,366)) >>> ts
Setting the index value in index is a value from 1 to 366.
Series data structure is most similar to the dictionary in python, creating Series from the dictionary:
sd = {'xiaoming':14,'tom':15,'john':13} s4 = pd.Series(sd)
At this point, you can see that Series already has its own index.
pandas itself has a lot of connections with Matplotlib, another third-party library of python. Matplotlib is most often used to display data. If you don't know about Matplotlib, the following chapters will be introduced. Now take it and use it directly. If you haven't installed it, install pip install Matplotlib with the same pip command, as shown below. Data:
import pandas as pd import numpy as np import matplotlib.pyplot as plt ts = pd.Series(np.random.randn(365), index=np.arange(1,366)) ts.plot() plt.show()
In data analysis, time is an important feature of an irregular graph, because a lot of data are related to time, sales are related to time, weather is related to time... Some functions about time are also provided in pandas, using date_range to generate a series of times.
>>> pd.date_range('01/01/2017',periods=365) DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08', '2017-01-09', '2017-01-10', ... '2017-12-22', '2017-12-23', '2017-12-24', '2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29', '2017-12-30', '2017-12-31'], dtype='datetime64[ns]', length=365, freq='D') >>>
Before, our graphics were irregular. One reason was that the data was not continuous. We used cumsum to make the data continuous.
As follows:
import pandas as pd import numpy as np import matplotlib.pyplot as plt ts = pd.Series(np.random.randn(365), index=pd.date_range('01/01/2017',periods=365)) ts = ts.cumsum() ts.plot() plt.show()
DataFrame
Data Frame is a two-dimensional data model, which is equivalent to the data in EXcel table. It has two coordinates, horizontal axis is very Series using index, vertical axis is determined by columns. When building data Frame object, three elements need to be determined: data, horizontal axis and vertical axis.
df = pd.DataFrame(np.random.randn(8,6), index=pd.date_range('01/01/2018',periods=8),columns=list('ABCDEF')) print df
The data are as follows:
A B C D E F 2018-01-01 0.712636 0.546680 -0.847866 -0.629005 2.152686 0.563907 2018-01-02 -1.292799 1.122098 0.743293 0.656412 0.989738 2.468200 2018-01-03 1.762894 0.783614 -0.301468 0.289608 -0.780844 0.873074 2018-01-04 -0.818066 1.629542 -0.595451 0.910141 0.160980 0.306660 2018-01-05 2.008658 0.456592 -0.839597 1.615013 0.718422 -0.564584 2018-01-06 0.480893 0.724015 -1.076434 -0.253731 0.337147 -0.028212 2018-01-07 -0.672501 0.739550 -1.316094 1.118234 -1.456680 -0.601890 2018-01-08 -1.028436 -1.036542 -0.459044 1.321962 -0.198338 -1.034822
In the process of data analysis, it is very common for data to come directly from excel or cvs, which can read data from excel to DataFrame and process data in DataFrame.
df = pd.read_excel('data.xlsx',sheet_name= 'Sheet1') print df
Same data is saved to_excel in excel.
The functions of processing CVS data are read_cvs and to_cvs, and the functions of processing HDF5 are read_hdf and to_hdf.
Data Frame can be accessed in the same way as a binary array:
print df['A']
Take out the horizontal axis label:
2018-01-01 0.712636 2018-01-02 -1.292799 2018-01-03 1.762894 2018-01-04 -0.818066 2018-01-05 2.008658 2018-01-06 0.480893 2018-01-07 -0.672501 2018-01-08 -1.028436
Similarly, you can specify an element:
print df['A']['2018-01-01']
Slice the arrays and recognize the horizontal and vertical axes.
>>> import pandas as pd >>> df = pd.read_excel('data.xlsx',sheet_name= 'Sheet1') >>> df[:][0:3] A B C D E F 2018-01-01 0.712636 0.546680 -0.847866 -0.629005 2.152686 0.563907 2018-01-02 -1.292799 1.122098 0.743293 0.656412 0.989738 2.468200 2018-01-03 1.762894 0.783614 -0.301468 0.289608 -0.780844 0.873074 >>>
There are more functions involved in the DataFrame, and more about them will follow.
Reproduced please indicate where: http://www.bugingcode.com/
More tutorials: A Cat Learns Programming