Introduction to pandas (1): Installation and Creation of pandas

Posted by jthomp7 on Wed, 15 May 2019 20:33:13 +0200

pandas is a third-party library that must be familiar with for data analysts. pandas has great advantages in scientific computing, especially for data analysts. There is Numpy in python, but Numpy is more mathematic. There is also a need for a database to represent the data model more specifically. We are very clear that EXCEL plays a very important role in data processing. The table model is the best presentation of the data model.

pandas is a simulation of table data model on python. It is simple like SQL for data processing and can be easily implemented on python.

Installation of pandas

The installation of pandas on python also uses pip:

pip install pandas

pandas Create Objects

pandas has two data structures: Series and DataFrame.

Series

Series is like a data list in python, where each data has its own index. Create Series from list.

>>> import pandas as pd
>>> s1 = pd.Series([100,23,'bugingcode'])
>>> s1
0           100
1            23
2    bugingcode
dtype: object
>>>

Add the corresponding index in Series:

>>> import numpy as np
>>> ts = pd.Series(np.random.randn(365), index=np.arange(1,366))
>>> ts

Setting the index value in index is a value from 1 to 366.

Series data structure is most similar to the dictionary in python, creating Series from the dictionary:

sd = {'xiaoming':14,'tom':15,'john':13}
s4 = pd.Series(sd)

At this point, you can see that Series already has its own index.

pandas itself has a lot of connections with Matplotlib, another third-party library of python. Matplotlib is most often used to display data. If you don't know about Matplotlib, the following chapters will be introduced. Now take it and use it directly. If you haven't installed it, install pip install Matplotlib with the same pip command, as shown below. Data:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

ts = pd.Series(np.random.randn(365), index=np.arange(1,366))
ts.plot()
plt.show()

In data analysis, time is an important feature of an irregular graph, because a lot of data are related to time, sales are related to time, weather is related to time... Some functions about time are also provided in pandas, using date_range to generate a series of times.

>>> pd.date_range('01/01/2017',periods=365)
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
               '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
               '2017-01-09', '2017-01-10',
               ...
               '2017-12-22', '2017-12-23', '2017-12-24', '2017-12-25',
               '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29',
               '2017-12-30', '2017-12-31'],
              dtype='datetime64[ns]', length=365, freq='D')
>>>

Before, our graphics were irregular. One reason was that the data was not continuous. We used cumsum to make the data continuous.

As follows:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

ts = pd.Series(np.random.randn(365), index=pd.date_range('01/01/2017',periods=365))
ts = ts.cumsum()
ts.plot()
plt.show()

DataFrame

Data Frame is a two-dimensional data model, which is equivalent to the data in EXcel table. It has two coordinates, horizontal axis is very Series using index, vertical axis is determined by columns. When building data Frame object, three elements need to be determined: data, horizontal axis and vertical axis.

df = pd.DataFrame(np.random.randn(8,6), index=pd.date_range('01/01/2018',periods=8),columns=list('ABCDEF'))
print df

The data are as follows:

                   A         B         C         D         E         F
2018-01-01  0.712636  0.546680 -0.847866 -0.629005  2.152686  0.563907
2018-01-02 -1.292799  1.122098  0.743293  0.656412  0.989738  2.468200
2018-01-03  1.762894  0.783614 -0.301468  0.289608 -0.780844  0.873074
2018-01-04 -0.818066  1.629542 -0.595451  0.910141  0.160980  0.306660
2018-01-05  2.008658  0.456592 -0.839597  1.615013  0.718422 -0.564584
2018-01-06  0.480893  0.724015 -1.076434 -0.253731  0.337147 -0.028212
2018-01-07 -0.672501  0.739550 -1.316094  1.118234 -1.456680 -0.601890
2018-01-08 -1.028436 -1.036542 -0.459044  1.321962 -0.198338 -1.034822

In the process of data analysis, it is very common for data to come directly from excel or cvs, which can read data from excel to DataFrame and process data in DataFrame.

df = pd.read_excel('data.xlsx',sheet_name= 'Sheet1')
print df

Same data is saved to_excel in excel.

The functions of processing CVS data are read_cvs and to_cvs, and the functions of processing HDF5 are read_hdf and to_hdf.

Data Frame can be accessed in the same way as a binary array:

print df['A']

Take out the horizontal axis label:

2018-01-01    0.712636
2018-01-02   -1.292799
2018-01-03    1.762894
2018-01-04   -0.818066
2018-01-05    2.008658
2018-01-06    0.480893
2018-01-07   -0.672501
2018-01-08   -1.028436

Similarly, you can specify an element:

print df['A']['2018-01-01']

Slice the arrays and recognize the horizontal and vertical axes.

>>> import pandas as pd
>>> df = pd.read_excel('data.xlsx',sheet_name= 'Sheet1')
>>> df[:][0:3]
                   A         B         C         D         E         F
2018-01-01  0.712636  0.546680 -0.847866 -0.629005  2.152686  0.563907
2018-01-02 -1.292799  1.122098  0.743293  0.656412  0.989738  2.468200
2018-01-03  1.762894  0.783614 -0.301468  0.289608 -0.780844  0.873074
>>>

There are more functions involved in the DataFrame, and more about them will follow.

Reproduced please indicate where: http://www.bugingcode.com/

More tutorials: A Cat Learns Programming

Topics: Python Excel pip Database