Data analysis Day08

Posted by Bookmark on Wed, 29 Dec 2021 22:52:49 +0100

Data analysis DAY08

pandas Foundation

Introduction to pandas

Python Data Analysis Library

Pandas is a NumPy based tool created to solve data analysis tasks. Pandas incorporates a large number of libraries and some standard data models, providing the tools needed to operate large structured data sets efficiently.

pandas core data structure

Data structure is a way for computers to store and organize data. In general, carefully selected data structures can bring higher operation or storage efficiency. Data structure is often related to efficient retrieval algorithm and index technology.

Series

Series can be understood as a one-dimensional array, but the index name can be changed by itself. Similar to a fixed length ordered dictionary, there are index and value.

import pandas as pd
import numpy as np

# Create an empty series
s = pd.Series()
# Create a series from ndarray
data = np.array(['a','b','c','d'])
s = pd.Series(data)
s = pd.Series(data,index=[100,101,102,103])
# Create a series from a dictionary	
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
# Create a series from scalar
s = pd.Series(5, index=[0, 1, 2, 3])

To access data in a Series:

# Retrieving elements using indexes
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(s[0], s[:3], s[-3:])
# Retrieving data using tags
print(s['a'], s[['a','c','d']])

pandas date processing

# Date string format recognized by pandas
dates = pd.Series(['2011', '2011-02', '2011-03-01', '2011/04/01', 
                   '2011/05/01 01:01:01', '01 Jun 2011'])
# to_datetime() converts the date data type
dates = pd.to_datetime(dates)
print(dates, dates.dtype, type(dates))
print(dates.dt.day)

# datetime type data supports date operation
delta = dates - pd.to_datetime('1970-01-01')
# Get the number of days
print(delta.dt.days)

Series.dt provides many date related operations, as follows:

Series.dt.year	The year of the datetime.
Series.dt.month	The month as January=1, December=12.
Series.dt.day	The days of the datetime.
Series.dt.hour	The hours of the datetime.
Series.dt.minute	The minutes of the datetime.
Series.dt.second	The seconds of the datetime.
Series.dt.microsecond	The microseconds of the datetime.
Series.dt.week	The week ordinal of the year.
Series.dt.weekofyear	The week ordinal of the year.
Series.dt.dayofweek	The day of the week with Monday=0, Sunday=6.
Series.dt.weekday	The day of the week with Monday=0, Sunday=6.
Series.dt.dayofyear	The ordinal day of the year.
Series.dt.quarter	The quarter of the date.
Series.dt.is_month_start	Indicates whether the date is the first day of the month.
Series.dt.is_month_end	Indicates whether the date is the last day of the month.
Series.dt.is_quarter_start	Indicator for whether the date is the first day of a quarter.
Series.dt.is_quarter_end	Indicator for whether the date is the last day of a quarter.
Series.dt.is_year_start	Indicate whether the date is the first day of a year.
Series.dt.is_year_end	Indicate whether the date is the last day of the year.
Series.dt.is_leap_year	Boolean indicator if the date belongs to a leap year.
Series.dt.days_in_month	The number of days in the month.

DateTimeIndex

By specifying the period and frequency, use date The range () function creates a date series. By default, the frequency of the range is days.

import pandas as pd
# Daily frequency
datelist = pd.date_range('2019/08/21', periods=5)
print(datelist)
# Monthly frequency
datelist = pd.date_range('2019/08/21', periods=5,freq='M')
print(datelist)
# Constructing a time series of an interval
start = pd.datetime(2017, 11, 1)
end = pd.datetime(2017, 11, 5)
dates = pd.date_range(start, end)
print(dates)

bdate_range() is used to represent the business date range, which is different from date_range(), which excludes Saturdays and Sundays.

import pandas as pd
datelist = pd.bdate_range('2011/11/03', periods=5)
print(datelist)

DataFrame

DataFrame is a data type similar to a table. It can be understood as a two-dimensional array. The index has two dimensions and can be changed. DataFrame has the following characteristics:

  • Potential columns are of different types
  • Variable size
  • Mark axes (rows and columns)
  • You can perform arithmetic operations on rows and columns
import pandas as pd

# Create an empty DataFrame
df = pd.DataFrame()
print(df)

# Create DataFrame from list
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print(df)
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)

# Create a DataFrame from a dictionary
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['s1','s2','s3','s4'])
print(df)
data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
print(df)

Core data structure operation

Column access

The single column data of DataFrame is a Series. According to the definition of DataFrame, we can know that DataFrame is a two-dimensional array with labels, and each label is equivalent to the column name of each column.

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df['one'])
print(df[['one', 'two']])

Column addition

The method of adding a column to the DataFrame is very simple. You only need to create a new column index. And assign values to the data under the index.

import pandas as pd

data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['s1','s2','s3','s4'])
df['score']=pd.Series([90, 80, 70, 60], index=['s1','s2','s3','s4'])
print(df)

Column deletion

To delete a column of data, you need to use the pop method provided by pandas. The usage of the pop method is as follows:

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
     'three' : pd.Series([10, 20, 30], index=['a', 'b', 'c'])}
df = pd.DataFrame(d)
print("dataframe is:")
print(df)

# Delete a column: one
del(df['one'])
print(df)

#Call the pop method to delete a column
df.pop('two')
print(df)

Row access

If you only need to access the implementation of some rows of data in the DataFrame, you can select the array by using "::

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
    'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df[2:4])

The loc method is a slicing method for DataFrame index names. The loc method is used as follows:

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df.loc['b'])
print(df.loc[['a', 'b']])

The difference between iloc and loc is that iloc must receive the positions of row index and column index. The iloc method is used as follows:

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print(df.iloc[2])
print(df.iloc[[2, 3]])

Row add

import pandas as pd

df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age'])
df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age'])

df = df.append(df2)
print(df)

Row deletion

Use index tags to delete or delete rows from the DataFrame. If the labels are duplicate, multiple rows are deleted.

import pandas as pd

df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age'])
df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age'])
df = df.append(df2)
# Delete row with index 0
df = df.drop(0)
print(df)

Modify data in DataFrame

The principle of changing the data in the DataFrame is to extract this part of data and re assign it to new data.

import pandas as pd

df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age'])
df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age'])
df = df.append(df2)
df['Name'][0] = 'Tom'
print(df)

DataFrame common properties

numberProperty or methoddescribe
1axesReturns a list of row / column labels (index es).
2dtypeReturns the data type (dtype) of the object.
3emptyReturns True if the series is empty.
4ndimReturns the dimension of the underlying data. Default definition: 1.
5sizeReturns the number of elements in the underlying data.
6valuesReturn series as ndarray.
7head(n)Returns the first n rows.
8tail(n)Returns the last n rows.

Example code:

import pandas as pd

data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['s1','s2','s3','s4'])
df['score']=pd.Series([90, 80, 70, 60], index=['s1','s2','s3','s4'])
print(df)
print(df.axes)
print(df['Age'].dtype)
print(df.empty)
print(df.ndim)
print(df.size)
print(df.values)
print(df.head(3)) # First three lines of df
print(df.tail(3)) # The last three lines of df

Jupyter notebook

Jupyter Notebook (formerly known as IPython notebook) is an interactive notebook that supports running more than 40 programming languages. It uses the browser as the interface to send requests to the background IPython server and display the results. The essence of Jupiter notebook is a Web application, which is easy to create and share literary program documents, and supports real-time code, mathematical equations, visualization and Mar kdown.

IPython is a python interactive shell, which is much easier to use than the default python shell. It supports automatic variable completion and automatic indentation, supports bash shell commands, and has built-in many useful functions and functions.

Install ipython

windows: provided that numpy, Matplotlib and pandas are available

pip install ipython

OS X: Download and install Apple development tool Xcode from the app store.

Using easy_install or pip install IPython, or install from the source file.

Install Jupiter notebook

pip3 install jupyter

Launch Jupiter notebook

Find a working directory and run the command:
jupyter notebook

pandas core

pandas descriptive statistics

The descriptive statistics of numerical data mainly includes the calculation of the integrity, minimum, mean, median, maximum, quartile, range, standard deviation, variance, covariance, etc. Some commonly used statistical functions in NumPy library can also be used for descriptive statistics of data frames.

np.min	minimum value 
np.max	Maximum 
np.mean	mean value 
np.ptp	range 
np.median	median 
np.std	standard deviation 
np.var	variance 
np.cov	covariance

example:

import pandas as pd
import numpy as np

# Create DF
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack', 'Lee', 'David', 'Gasper', 'Betina', 'Andres']),
  'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

df = pd.DataFrame(d)
print(df)
# Test descriptive statistical function
print(df.sum())
print(df.sum(1))
print(df.mean())
print(df.mean(1))

pandas provides statistical correlation functions:

1count()Number of non empty observations
2sum()Sum of all values
3mean()Average of all values
4median()Median of all values
5std()Standard deviation of values
6min()Minimum of all values
7max()Maximum of all values
8abs()absolute value
9prod()Product of array elements
10cumsum()Cumulative sum
11cumprod()Cumulative product

pandas also provides a method called describe, which can obtain the number of non null values, mean and standard deviation of all numerical features of the data frame at one time.

import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print(df.describe())
print(df.describe(include=['object']))
print(df.describe(include=['number']))

pandas sort

Pandas has two sorting methods: by label and by actual value.

import pandas as pd
import numpy as np

unsorted_df=pd.DataFrame(np.random.randn(10,2),
                         index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
print(unsorted_df)

Sort by row label

Using sort_ The index () method can sort the DataFrame by passing the axis parameter and sorting order. By default, row labels are sorted in ascending order.

import pandas as pd
import numpy as np

# Sort by row header
sorted_df=unsorted_df.sort_index()
print (sorted_df)
# Control sort order
sorted_df = unsorted_df.sort_index(ascending=False)
print (sorted_df)

Sort by column label

import numpy as np

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
unsorted_df = pd.DataFrame(d)
# Sort by column label
sorted_df=unsorted_df.sort_index(axis=1)
print (sorted_df)

Sort by a column of values

Like index sorting, sort_values() is a method of sorting by value. It accepts a by parameter that will use the column name of the DataFrame with which to sort values.

import pandas as pd
import numpy as np

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
unsorted_df = pd.DataFrame(d)
# Sort by age
sorted_df = unsorted_df.sort_values(by='Age')
print (sorted_df)
# Sort by Age in ascending order and then by Rating in descending order
sorted_df = unsorted_df.sort_values(by=['Age', 'Rating'], ascending=[True, False])
print (sorted_df)

pandas grouping

In many cases, we divide the data into multiple sets and apply some functions to each subset. In the application function, you can do the following:

  • Aggregation - calculate summary statistics
  • Transform - performs some group specific actions
  • Filtering - discarding data in some cases
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print(df)

Split data into groups

# Group by Year field
print (df.groupby('Year'))
# View grouping results
print (df.groupby('Year').groups)

Iterative traversal grouping

groupby returns an iteratable object. You can use the for loop to traverse:

grouped = df.groupby('Year')
# Traverse each group
for year,group in grouped:
    print (year)
    print (group)

Get a grouping detail

grouped = df.groupby('Year')
print (grouped.get_group(2014))

Grouping aggregation

The aggregate function returns an aggregate value for each group. When a group by object is created, operations such as summation and standard deviation can be performed on each group data.

# Aggregate the average score of each year
grouped = df.groupby('Year')
print (grouped['Points'].agg(np.mean))
# Aggregate the sum of scores, average score and standard deviation of each year
grouped = df.groupby('Year')
agg = grouped['Points'].agg([np.sum, np.mean, np.std])
print (agg)

pandas data table association operation

Pandas has a comprehensive high-performance in memory connection operation, which is very similar to relational databases such as SQL.
Pandas provides a separate merge() function as the entry point for all standard database connection operations between DataFrame objects.

Merge two dataframes:

import pandas as pd
left = pd.DataFrame({
         'student_id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
         'student_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung', 'Billy', 'Brian', 'Bran', 'Bryce', 'Betty', 'Emma', 'Marry', 'Allen', 'Jean', 'Rose', 'David', 'Tom', 'Jack', 'Daniel', 'Andrew'],
         'class_id':[1,1,1,2,2,2,3,3,3,4,1,1,1,2,2,2,3,3,3,2], 
         'gender':['M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F'], 
         'age':[20,21,22,20,21,22,23,20,21,22,20,21,22,23,20,21,22,20,21,22], 
         'score':[98,74,67,38,65,29,32,34,85,64,52,38,26,89,68,46,32,78,79,87]})
right = pd.DataFrame(
         {'class_id':[1,2,3,5],
         'class_name': ['ClassA', 'ClassB', 'ClassC', 'ClassE']})
# Merge two dataframes
data = pd.merge(left,right)
print(data)

Merge dataframes using the "how" parameter:

# Merge two dataframes (left connection)
rs = pd.merge(left, right, how='left')
print(rs)

Other consolidation methods are the same as the database:

Merge methodSQL equivalentdescribe
leftLEFT OUTER JOINUse the key of the object on the left
rightRIGHT OUTER JOINUse the key of the object on the right
outerFULL OUTER JOINUnion using key
innerINNER JOINIntersection using keys

Test:

# Merge two dataframes (left connection)
rs = pd.merge(left,right,on='subject_id', how='right')
print(rs)
# Merge two dataframes (left connection)
rs = pd.merge(left,right,on='subject_id', how='outer')
print(rs)
# Merge two dataframes (left connection)
rs = pd.merge(left,right,on='subject_id', how='inner')
print(rs)

pandas PivotTable and crosstab

The following data are available:

import pandas as pd
left = pd.DataFrame({
         'student_id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
         'student_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung', 'Billy', 'Brian', 'Bran', 'Bryce', 'Betty', 'Emma', 'Marry', 'Allen', 'Jean', 'Rose', 'David', 'Tom', 'Jack', 'Daniel', 'Andrew'],
         'class_id':[1,1,1,2,2,2,3,3,3,4,1,1,1,2,2,2,3,3,3,2], 
         'gender':['M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F'], 
         'age':[20,21,22,20,21,22,23,20,21,22,20,21,22,23,20,21,22,20,21,22], 
         'score':[98,74,67,38,65,29,32,34,85,64,52,38,26,89,68,46,32,78,79,87]})
right = pd.DataFrame(
         {'class_id':[1,2,3,5],
         'class_name': ['ClassA', 'ClassB', 'ClassC', 'ClassE']})
# Merge two dataframes
data = pd.merge(left,right)
print(data)

Pivot table

Pivot table is a common data summary tool in various spreadsheet programs and other data analysis software. It groups and aggregates data according to one or more keys, and summarizes data according to each group.

# With class_id and gender are used to group and summarize data. By default, all columns are aggregated and counted
print(data.pivot_table(index=['class_id', 'gender']))

# With class_id and gender are used to group and summarize data and aggregate the score column
print(data.pivot_table(index=['class_id', 'gender'], values=['score']))

# With class_id and gender are used for grouping and summarizing data, aggregating the score column, and grouping statistics at the column level for each value of age
print(data.pivot_table(index=['class_id', 'gender'], values=['score'], columns=['age']))

# With class_id and gender do grouping and summary data, aggregate and count score columns, and add row and column subtotals for each value column level grouping statistics of age
print(data.pivot_table(index=['class_id', 'gender'], values=['score'], 
                       columns=['age'], margins=True))

# With class_id and gender do grouping and summary data, aggregate and count score columns, and add row and column subtotals for each value column level grouping statistics of age
print(data.pivot_table(index=['class_id', 'gender'], values=['score'], columns=['age'], margins=True, aggfunc='max'))

Cross table

Cross tab is a special pivot table used to calculate grouping frequency:

# By class_id grouping. Count the number of different gender s
print(pd.crosstab(data.class_id, data.gender, margins=True))

pandas visualization

Basic drawings: drawings

import pandas as pd
import numpy as np
import matplotlib.pyplot as mp 

df = pd.DataFrame(np.random.randn(10,4),index=pd.date_range('2018/12/18',
   periods=10), columns=list('ABCD'))
df.plot()
mp.show()

The plot method allows a few plot styles other than the default line graph. These methods can be used as kind keyword parameters of plot(). These include:

  • Bar or barh is a bar
  • hist is histogram
  • Scatter is a scatter diagram

Bar chart

df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])
df.plot.bar()
# df.plot.bar(stacked=True)
mp.show()

histogram

df = pd.DataFrame()
df['a'] = pd.Series(np.random.normal(0, 1, 1000)-1)
df['b'] = pd.Series(np.random.normal(0, 1, 1000))
df['c'] = pd.Series(np.random.normal(0, 1, 1000)+1)
print(df)
df.plot.hist(bins=20)
mp.show()

Scatter diagram

df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])
df.plot.scatter(x='a', y='b')
mp.show()

Pie chart

df = pd.DataFrame(3 * np.random.rand(4), index=['a', 'b', 'c', 'd'], columns=['x'])
df.plot.pie(subplots=True)
mp.show()

Data reading and storage

Read and store csv:

# filepath file path. The string can be a URL. Valid URL schemes include http, ftp and file 
# sep separator. read_csv defaults to "," and read_table defaults to Tab '[Tab]'.
# header receives int or sequence. Indicates that a row of data is used as a column name. The default value is infer, indicating automatic recognition.
# names receives an array. Represents the column name.
# index_col represents the position of the index column. If the value is sequence, it represents multiple indexes. 
# dtype represents the data type to be written (the column name is key and the data format is values).
# Engine receives c or python. Represents the data parsing engine. The default is c.
# nrows receives int. Indicates the first n rows read.

pd.read_table(
    filepath_or_buffer, sep='\t', header='infer', names=None, 
    index_col=None, dtype=None, engine=None, nrows=None) 
pd.read_csv(
    filepath_or_buffer, sep=',', header='infer', names=None, 
    index_col=None, dtype=None, engine=None, nrows=None)
DataFrame.to_csv(excel_writer=None, sheetname=None, header=True, index=True, index_label=None, mode='w', encoding=None) 

Reading and storing excel:

# io indicates the file path.
# sheetname represents the position of data in excel table. The default is 0. 
# header receives int or sequence. Indicates that a row of data is used as a column name. The default value is infer, indicating automatic recognition.
# names represents the position of the index column. If the value is sequence, it represents multiple indexes.
# index_col represents the position of the index column. If the value is sequence, it represents multiple indexes.
# dtype receives dict. Data type.
pandas.read_excel(io, sheetname=0, header=0, index_col=None, names=None, dtype=None)
DataFrame.to_excel(excel_writer=None, sheetname=None, header=True, index=True, index_label=None, mode='w', encoding=None) 

Read and store JSON:

# It is converted into a dictionary through the json module and then into a DataFrame
pd.read_json('../ratings.json')

Analysis of movielens film scoring data

The requirements are as follows:

  1. Read the data, read the user information from the user table, and import the movie scoring table and movie data table in the same way.

  2. Consolidated data sheet

  3. Preliminary description and analysis of data

  4. Check the average score of each film by gender, calculate the difference between differences, and then sort

  5. Calculate the average score of each film and sort it

  6. View and sort movies with high ratings

  7. Filter out movies with less than 250 entries

  8. Top ten films

  9. The distribution of different ages is viewed and visualized by histogram

  10. Mark the age group where the user is located in the original data

  11. Visualizing movies_ Frequency of different types of films in ratings

"""
demo01_series.py   Series Basics
"""
import numpy as np
import pandas as pd

# Empty Series
s = pd.Series()
print(s)

# Building Series from data
data = np.array(['zs', 'ls', 'ww', 'zl'])
s = pd.Series(data)
print(s)

# Specify the index name (take the student number as the index) to build the Series
data = np.array(['zs', 'ls', 'ww', 'zl'])
s = pd.Series(
	data, index=['1001','1002','1003','1004'])
print(s)

# Create Series from dictionary
s = pd.Series({'zs':80, 'ls':90, 'ww':70})
print('---\n', s)

# Create Series from scalar
s = pd.Series(5, index=['a', 'b', 'c', 'd'])
print('---\n', s)

# Building Series access elements
data = np.array(['zs', 'ls', 'ww', 'zl'])
s = pd.Series(
	data, index=['1001','1002','1003','1004'])
print('---\n', s)
print('s[2]: ', s[2])	# Accessing elements through index Subscripts
print('s[2:]: \n', s[2:]) # Access by index slice
print('s["1003"]:', s['1003']) # Access elements through index Tags
print(s[['1001', '1002', '1004']])


# Date related processing of test pandas
print('-' * 45)
dates = pd.Series(
	['2011', '2011-02', '2011-03-01', '2011/04/01', 
     '2011/05/01 01:01:01', '01 Jun 2011'])
dates = pd.to_datetime(dates)
print(dates)
print('dates months:')
print(dates.dt.month)

# Date operation
print('-' * 45)
delta = dates - pd.to_datetime('1970')
print(delta)
print(delta.dt.days)

# Generate time series
print('-' * 45)
dates = pd.date_range(
	'2019-10-01', periods=7, freq='M')
print(dates)

dates = pd.bdate_range(
	'2019-11-01', periods=7)
print(dates)


"""
demo02_dateframe.py   Data frame operation
"""
import numpy as np
import pandas as pd

# Create an empty df
df = pd.DataFrame()
print(df)

# Create df from list data source
print('-' * 45)
data = [80, 90, 70, 77]
df = pd.DataFrame(data)
print(df)

# Building df from 2D datasets
data = [['Tom', 80], ['Jerry', 70], ['Alex', 25]]
df = pd.DataFrame(data, 
	index=['101','102','103'], 
	columns=['name', 'score'])
print('-' * 45)
print(df)

# 
print('-' * 45)
df = pd.DataFrame(
	[{'name':'zs', 'age':10}, 
	 {'name':'ls', 'age':12, 'score':90}])
print(df)

# Building df from dictionary
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['01','02','03','04'])
print('-' * 45)
print(df)

# Column access
print('-' * 45)
print(df['Name'])
print(df[['Name', 'Age']])

# Column addition
print('-' * 45)
df['score'] = pd.Series(
	[90, 80, 77, 36], index=df.index)
print(df)

# Column deletion
print('-' * 45)
df.pop('score')
print(df)

# Row access
print('-' * 45)
print(df[-3:])

print('-' * 45)
print(df.loc[['02','04']])  # Access rows through index Tags

print('-' * 45)
print(df.iloc[[0, 2]])  # Accessing rows through index Subscripts

# Row add
data = {'Name':['Lily', 'Lucy'],
        'Age':[28, 34]}
df2 = pd.DataFrame(data, index=['05','06'])
df = df.append(df2)
print('-' * 45)
print(df)

# Row deletion
df = df.drop(['05', '03'])
print('-' * 45)
print(df)

# Modify element
df['Age'] = 40
print(df)
df['Age']['04'] = 23
print(df)

print(df.values)
print(df.head(2))
print(df.tail(2))

Topics: Python Data Analysis