Data analysis pandas foundation, Jupiter notebook

Posted by sks1024 on Thu, 03 Feb 2022 03:23:02 +0100

(I) pandas Foundation

1, Introduction to pandas

Python Data Analysis Library

Pandas is a tool based on NumPy, which was created to solve the task of data analysis. Pandas incorporates a large number of libraries and some standard data models, providing the tools needed to operate large structured data sets efficiently.

2, pandas core data structure

Data structure is the way that computer stores and organizes data. In general, carefully selected data structures can bring higher operation or storage efficiency. Data structure is often related to efficient retrieval algorithm and index technology.

(1),Series

Series can be understood as a one-dimensional array, but the index name can be changed by itself. Similar to a fixed length ordered dictionary, there are index and value.

import pandas as pd
import numpy as np

# Create an empty series
s = pd.Series()
# Create a series from ndarray
data = np.array(['a','b','c','d'])
s = pd.Series(data)
s = pd.Series(data,index=[100,101,102,103])
# Create a series from a dictionary	
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
# Create a series from scalar
s = pd.Series(5, index=[0, 1, 2, 3])

To access data in Series:

# Retrieving elements using indexes
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(s[0], s[:3], s[-3:])
# Retrieving data using tags
print(s['a'], s[['a','c','d']])

pandas date processing

# Date string format recognized by pandas
dates = pd.Series(['2011', '2011-02', '2011-03-01', '2011/04/01', 
                   '2011/05/01 01:01:01', '01 Jun 2011'])
# to_datetime() converts the date data type
dates = pd.to_datetime(dates)
print(dates, dates.dtype, type(dates))
# datetime type data supports date operation
delta = dates - pd.to_datetime('1970-01-01')
# Get the number of days
print(delta.dt.days)

Series.dt provides many date related operations, as follows:

Series.dt.year	The year of the datetime.
Series.dt.month	The month as January=1, December=12.
Series.dt.day	The days of the datetime.
Series.dt.hour	The hours of the datetime.
Series.dt.minute	The minutes of the datetime.
Series.dt.second	The seconds of the datetime.
Series.dt.microsecond	The microseconds of the datetime.
Series.dt.week	The week ordinal of the year.
Series.dt.weekofyear	The week ordinal of the year.
Series.dt.dayofweek	The day of the week with Monday=0, Sunday=6.
Series.dt.weekday	The day of the week with Monday=0, Sunday=6.
Series.dt.dayofyear	The ordinal day of the year.
Series.dt.quarter	The quarter of the date.
Series.dt.is_month_start	Indicates whether the date is the first day of the month.
Series.dt.is_month_end	Indicates whether the date is the last day of the month.
Series.dt.is_quarter_start	Indicator for whether the date is the first day of a quarter.
Series.dt.is_quarter_end	Indicator for whether the date is the last day of a quarter.
Series.dt.is_year_start	Indicate whether the date is the first day of a year.
Series.dt.is_year_end	Indicate whether the date is the last day of the year.
Series.dt.is_leap_year	Boolean indicator if the date belongs to a leap year.
Series.dt.days_in_month	The number of days in the month.

(2)DateTimeIndex

By specifying the period and frequency, use date The range () function creates a date series. By default, the frequency of the range is days.

import pandas as pd
# Daily frequency
datelist = pd.date_range('2019/08/21', periods=5)
print(datelist)
# Monthly frequency
datelist = pd.date_range('2019/08/21', periods=5,freq='M')
print(datelist)
# Constructing a time series of an interval
start = pd.datetime(2017, 11, 1)
end = pd.datetime(2017, 11, 5)
dates = pd.date_range(start, end)
print(dates)

bdate_range() is used to represent the business date range, which is different from date_range(), which excludes Saturdays and Sundays.

import pandas as pd
datelist = pd.bdate_range('2011/11/03', periods=5)
print(datelist)

(3)DataFrame

DataFrame is a data type similar to a table. It can be understood as a two-dimensional array. The index has two dimensions and can be changed. DataFrame has the following characteristics:

  • Potential columns are of different types
  • Variable size
  • Mark axis (row and column)
  • You can perform arithmetic operations on rows and columns
import pandas as pd

# Create an empty DataFrame
df = pd.DataFrame()
print(df)

# Create DataFrame from list
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print(df)
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)

# Create DataFrame from dictionary
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['s1','s2','s3','s4'])
print(df)
data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
        'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
print(df)

(4) Core data structure operation

Column access

The single column data of DataFrame is a Series. According to the definition of DataFrame, we can know that DataFrame is a two-dimensional array with labels, and each label is equivalent to the column name of each column.

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df['one'])
print(df[['one', 'two']])

Column addition

The method of adding a column to the DataFrame is very simple. You only need to create a new column index. And assign values to the data under the index.

import pandas as pd

data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['s1','s2','s3','s4'])
df['score']=pd.Series([90, 80, 70, 60], index=['s1','s2','s3','s4'])
print(df)

Column deletion

To delete a column of data, you need to use the method pop provided by pandas. The usage of pop method is as follows:

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
     'three' : pd.Series([10, 20, 30], index=['a', 'b', 'c'])}
df = pd.DataFrame(d)
print("dataframe is:")
print(df)

# Delete a column: one
del(df['one'])
print(df)

#Call the pop method to delete a column
df.pop('two')
print(df)

Row access

If you only need to access the implementation of some rows of data in the DataFrame, you can select the array by using "::

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
    'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df[2:4])

The loc method is a slicing method for DataFrame index names. The loc method is used as follows:

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df.loc['b'])
print(df.loc[['a', 'b']])

The difference between iloc and loc is that iloc must receive the positions of row index and column index. The iloc method is used as follows:

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df.iloc[2])
print(df.iloc[[2, 3]])

Row add

import pandas as pd

df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age'])
df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age'])

df = df.append(df2)
print(df)

Row deletion

Use index tags to delete or delete rows from the DataFrame. If the labels are duplicate, multiple rows are deleted.

import pandas as pd

df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age'])
df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age'])
df = df.append(df2)
# Delete row with index 0
df = df.drop(0)
print(df)

Modify data in DataFrame

The principle of changing the data in DataFrame is to extract this part of data and re assign it to new data.

import pandas as pd

df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age'])
df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age'])
df = df.append(df2)
df['Name'][0] = 'Tom'
print(df)

DataFrame common properties

numberProperty or methoddescribe
1axesReturns a list of row / column labels (index es).
2dtypeReturns the data type (dtype) of the object.
3emptyReturns True if the series is empty.
4ndimReturns the dimension of the underlying data. Default definition: 1.
5sizeReturns the number of elements in the basic data.
6valuesReturn series as ndarray.
7head()Returns the first n rows.
8tail()Returns the last n rows.

Example code:

import pandas as pd

data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['s1','s2','s3','s4'])
df['score']=pd.Series([90, 80, 70, 60], index=['s1','s2','s3','s4'])
print(df)
print(df.axes)
print(df['Age'].dtype)
print(df.empty)
print(df.ndim)
print(df.size)
print(df.values)
print(df.head(3)) # First three lines of df
print(df.tail(3)) # The last three lines of df

(II) Jupiter notebook

Jupiter notebook (formerly known as IPython notebook) is an interactive notebook that supports running more than 40 programming languages. Use the browser as the interface to send a request to the background IPython server and display the results. The essence of Jupiter notebook is a Web application, which is easy to create and share literary program documents, and supports real-time code, mathematical equations, visualization and markdown.

IPython is a python interactive shell, which is much easier to use than the default python shell. It supports automatic variable completion and automatic indentation, supports bash shell commands, and has built-in many useful functions and functions.

Install ipython

windows: the premise is numpy, Matplotlib and pandas

Install IPython with pip

OS X: Download and install Apple development tool Xcode from the app store.

Using easy_install or pip install IPython, or install from the source file.

Install Jupiter notebook

pip3 install jupyter

Topics: Python jupyter Data Analysis