Data analysis DAY08
pandas Foundation
Introduction to pandas
Python Data Analysis Library
Pandas is a NumPy based tool created to solve data analysis tasks. Pandas incorporates a large number of libraries and some standard data models, providing the tools needed to operate large structured data sets efficiently.
pandas core data structure
Data structure is a way for computers to store and organize data. In general, carefully selected data structures can bring higher operation or storage efficiency. Data structure is often related to efficient retrieval algorithm and index technology.
Series
Series can be understood as a one-dimensional array, but the index name can be changed by itself. Similar to a fixed length ordered dictionary, there are index and value.
import pandas as pd import numpy as np # Create an empty series s = pd.Series() # Create a series from ndarray data = np.array(['a','b','c','d']) s = pd.Series(data) s = pd.Series(data,index=[100,101,102,103]) # Create a series from a dictionary data = {'a' : 0., 'b' : 1., 'c' : 2.} s = pd.Series(data) # Create a series from scalar s = pd.Series(5, index=[0, 1, 2, 3])
To access data in a Series:
# Retrieving elements using indexes s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e']) print(s[0], s[:3], s[-3:]) # Retrieving data using tags print(s['a'], s[['a','c','d']])
pandas date processing
# Date string format recognized by pandas dates = pd.Series(['2011', '2011-02', '2011-03-01', '2011/04/01', '2011/05/01 01:01:01', '01 Jun 2011']) # to_datetime() converts the date data type dates = pd.to_datetime(dates) print(dates, dates.dtype, type(dates)) print(dates.dt.day) # datetime type data supports date operation delta = dates - pd.to_datetime('1970-01-01') # Get the number of days print(delta.dt.days)
Series.dt provides many date related operations, as follows:
Series.dt.year The year of the datetime. Series.dt.month The month as January=1, December=12. Series.dt.day The days of the datetime. Series.dt.hour The hours of the datetime. Series.dt.minute The minutes of the datetime. Series.dt.second The seconds of the datetime. Series.dt.microsecond The microseconds of the datetime. Series.dt.week The week ordinal of the year. Series.dt.weekofyear The week ordinal of the year. Series.dt.dayofweek The day of the week with Monday=0, Sunday=6. Series.dt.weekday The day of the week with Monday=0, Sunday=6. Series.dt.dayofyear The ordinal day of the year. Series.dt.quarter The quarter of the date. Series.dt.is_month_start Indicates whether the date is the first day of the month. Series.dt.is_month_end Indicates whether the date is the last day of the month. Series.dt.is_quarter_start Indicator for whether the date is the first day of a quarter. Series.dt.is_quarter_end Indicator for whether the date is the last day of a quarter. Series.dt.is_year_start Indicate whether the date is the first day of a year. Series.dt.is_year_end Indicate whether the date is the last day of the year. Series.dt.is_leap_year Boolean indicator if the date belongs to a leap year. Series.dt.days_in_month The number of days in the month.
DateTimeIndex
By specifying the period and frequency, use date The range () function creates a date series. By default, the frequency of the range is days.
import pandas as pd # Daily frequency datelist = pd.date_range('2019/08/21', periods=5) print(datelist) # Monthly frequency datelist = pd.date_range('2019/08/21', periods=5,freq='M') print(datelist) # Constructing a time series of an interval start = pd.datetime(2017, 11, 1) end = pd.datetime(2017, 11, 5) dates = pd.date_range(start, end) print(dates)
bdate_range() is used to represent the business date range, which is different from date_range(), which excludes Saturdays and Sundays.
import pandas as pd datelist = pd.bdate_range('2011/11/03', periods=5) print(datelist)
DataFrame
DataFrame is a data type similar to a table. It can be understood as a two-dimensional array. The index has two dimensions and can be changed. DataFrame has the following characteristics:
- Potential columns are of different types
- Variable size
- Mark axes (rows and columns)
- You can perform arithmetic operations on rows and columns
import pandas as pd # Create an empty DataFrame df = pd.DataFrame() print(df) # Create DataFrame from list data = [1,2,3,4,5] df = pd.DataFrame(data) print(df) data = [['Alex',10],['Bob',12],['Clarke',13]] df = pd.DataFrame(data,columns=['Name','Age']) print(df) data = [['Alex',10],['Bob',12],['Clarke',13]] df = pd.DataFrame(data,columns=['Name','Age'],dtype=float) print(df) data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] df = pd.DataFrame(data) print(df) # Create a DataFrame from a dictionary data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} df = pd.DataFrame(data, index=['s1','s2','s3','s4']) print(df) data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(data) print(df)
Core data structure operation
Column access
The single column data of DataFrame is a Series. According to the definition of DataFrame, we can know that DataFrame is a two-dimensional array with labels, and each label is equivalent to the column name of each column.
import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df['one']) print(df[['one', 'two']])
Column addition
The method of adding a column to the DataFrame is very simple. You only need to create a new column index. And assign values to the data under the index.
import pandas as pd data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} df = pd.DataFrame(data, index=['s1','s2','s3','s4']) df['score']=pd.Series([90, 80, 70, 60], index=['s1','s2','s3','s4']) print(df)
Column deletion
To delete a column of data, you need to use the pop method provided by pandas. The usage of the pop method is as follows:
import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 'three' : pd.Series([10, 20, 30], index=['a', 'b', 'c'])} df = pd.DataFrame(d) print("dataframe is:") print(df) # Delete a column: one del(df['one']) print(df) #Call the pop method to delete a column df.pop('two') print(df)
Row access
If you only need to access the implementation of some rows of data in the DataFrame, you can select the array by using "::
import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df[2:4])
The loc method is a slicing method for DataFrame index names. The loc method is used as follows:
import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df.loc['b']) print(df.loc[['a', 'b']])
The difference between iloc and loc is that iloc must receive the positions of row index and column index. The iloc method is used as follows:
import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df.iloc[2]) print(df.iloc[[2, 3]])
Row add
import pandas as pd df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age']) df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age']) df = df.append(df2) print(df)
Row deletion
Use index tags to delete or delete rows from the DataFrame. If the labels are duplicate, multiple rows are deleted.
import pandas as pd df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age']) df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age']) df = df.append(df2) # Delete row with index 0 df = df.drop(0) print(df)
Modify data in DataFrame
The principle of changing the data in the DataFrame is to extract this part of data and re assign it to new data.
import pandas as pd df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age']) df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age']) df = df.append(df2) df['Name'][0] = 'Tom' print(df)
DataFrame common properties
number | Property or method | describe |
---|---|---|
1 | axes | Returns a list of row / column labels (index es). |
2 | dtype | Returns the data type (dtype) of the object. |
3 | empty | Returns True if the series is empty. |
4 | ndim | Returns the dimension of the underlying data. Default definition: 1. |
5 | size | Returns the number of elements in the underlying data. |
6 | values | Return series as ndarray. |
7 | head(n) | Returns the first n rows. |
8 | tail(n) | Returns the last n rows. |
Example code:
import pandas as pd data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} df = pd.DataFrame(data, index=['s1','s2','s3','s4']) df['score']=pd.Series([90, 80, 70, 60], index=['s1','s2','s3','s4']) print(df) print(df.axes) print(df['Age'].dtype) print(df.empty) print(df.ndim) print(df.size) print(df.values) print(df.head(3)) # First three lines of df print(df.tail(3)) # The last three lines of df
Jupyter notebook
Jupyter Notebook (formerly known as IPython notebook) is an interactive notebook that supports running more than 40 programming languages. It uses the browser as the interface to send requests to the background IPython server and display the results. The essence of Jupiter notebook is a Web application, which is easy to create and share literary program documents, and supports real-time code, mathematical equations, visualization and Mar kdown.
IPython is a python interactive shell, which is much easier to use than the default python shell. It supports automatic variable completion and automatic indentation, supports bash shell commands, and has built-in many useful functions and functions.
Install ipython
windows: provided that numpy, Matplotlib and pandas are available
pip install ipython
OS X: Download and install Apple development tool Xcode from the app store.
Using easy_install or pip install IPython, or install from the source file.
Install Jupiter notebook
pip3 install jupyter
Launch Jupiter notebook
Find a working directory and run the command: jupyter notebook
pandas core
pandas descriptive statistics
The descriptive statistics of numerical data mainly includes the calculation of the integrity, minimum, mean, median, maximum, quartile, range, standard deviation, variance, covariance, etc. Some commonly used statistical functions in NumPy library can also be used for descriptive statistics of data frames.
np.min minimum value np.max Maximum np.mean mean value np.ptp range np.median median np.std standard deviation np.var variance np.cov covariance
example:
import pandas as pd import numpy as np # Create DF d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack', 'Lee', 'David', 'Gasper', 'Betina', 'Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])} df = pd.DataFrame(d) print(df) # Test descriptive statistical function print(df.sum()) print(df.sum(1)) print(df.mean()) print(df.mean(1))
pandas provides statistical correlation functions:
1 | count() | Number of non empty observations |
---|---|---|
2 | sum() | Sum of all values |
3 | mean() | Average of all values |
4 | median() | Median of all values |
5 | std() | Standard deviation of values |
6 | min() | Minimum of all values |
7 | max() | Maximum of all values |
8 | abs() | absolute value |
9 | prod() | Product of array elements |
10 | cumsum() | Cumulative sum |
11 | cumprod() | Cumulative product |
pandas also provides a method called describe, which can obtain the number of non null values, mean and standard deviation of all numerical features of the data frame at one time.
import pandas as pd import numpy as np #Create a Dictionary of series d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])} #Create a DataFrame df = pd.DataFrame(d) print(df.describe()) print(df.describe(include=['object'])) print(df.describe(include=['number']))
pandas sort
Pandas has two sorting methods: by label and by actual value.
import pandas as pd import numpy as np unsorted_df=pd.DataFrame(np.random.randn(10,2), index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1']) print(unsorted_df)
Sort by row label
Using sort_ The index () method can sort the DataFrame by passing the axis parameter and sorting order. By default, row labels are sorted in ascending order.
import pandas as pd import numpy as np # Sort by row header sorted_df=unsorted_df.sort_index() print (sorted_df) # Control sort order sorted_df = unsorted_df.sort_index(ascending=False) print (sorted_df)
Sort by column label
import numpy as np d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])} unsorted_df = pd.DataFrame(d) # Sort by column label sorted_df=unsorted_df.sort_index(axis=1) print (sorted_df)
Sort by a column of values
Like index sorting, sort_values() is a method of sorting by value. It accepts a by parameter that will use the column name of the DataFrame with which to sort values.
import pandas as pd import numpy as np d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])} unsorted_df = pd.DataFrame(d) # Sort by age sorted_df = unsorted_df.sort_values(by='Age') print (sorted_df) # Sort by Age in ascending order and then by Rating in descending order sorted_df = unsorted_df.sort_values(by=['Age', 'Rating'], ascending=[True, False]) print (sorted_df)
pandas grouping
In many cases, we divide the data into multiple sets and apply some functions to each subset. In the application function, you can do the following:
- Aggregation - calculate summary statistics
- Transform - performs some group specific actions
- Filtering - discarding data in some cases
import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print(df)
Split data into groups
# Group by Year field print (df.groupby('Year')) # View grouping results print (df.groupby('Year').groups)
Iterative traversal grouping
groupby returns an iteratable object. You can use the for loop to traverse:
grouped = df.groupby('Year') # Traverse each group for year,group in grouped: print (year) print (group)
Get a grouping detail
grouped = df.groupby('Year') print (grouped.get_group(2014))
Grouping aggregation
The aggregate function returns an aggregate value for each group. When a group by object is created, operations such as summation and standard deviation can be performed on each group data.
# Aggregate the average score of each year grouped = df.groupby('Year') print (grouped['Points'].agg(np.mean)) # Aggregate the sum of scores, average score and standard deviation of each year grouped = df.groupby('Year') agg = grouped['Points'].agg([np.sum, np.mean, np.std]) print (agg)
pandas data table association operation
Pandas has a comprehensive high-performance in memory connection operation, which is very similar to relational databases such as SQL.
Pandas provides a separate merge() function as the entry point for all standard database connection operations between DataFrame objects.
Merge two dataframes:
import pandas as pd left = pd.DataFrame({ 'student_id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], 'student_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung', 'Billy', 'Brian', 'Bran', 'Bryce', 'Betty', 'Emma', 'Marry', 'Allen', 'Jean', 'Rose', 'David', 'Tom', 'Jack', 'Daniel', 'Andrew'], 'class_id':[1,1,1,2,2,2,3,3,3,4,1,1,1,2,2,2,3,3,3,2], 'gender':['M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F'], 'age':[20,21,22,20,21,22,23,20,21,22,20,21,22,23,20,21,22,20,21,22], 'score':[98,74,67,38,65,29,32,34,85,64,52,38,26,89,68,46,32,78,79,87]}) right = pd.DataFrame( {'class_id':[1,2,3,5], 'class_name': ['ClassA', 'ClassB', 'ClassC', 'ClassE']}) # Merge two dataframes data = pd.merge(left,right) print(data)
Merge dataframes using the "how" parameter:
# Merge two dataframes (left connection) rs = pd.merge(left, right, how='left') print(rs)
Other consolidation methods are the same as the database:
Merge method | SQL equivalent | describe |
---|---|---|
left | LEFT OUTER JOIN | Use the key of the object on the left |
right | RIGHT OUTER JOIN | Use the key of the object on the right |
outer | FULL OUTER JOIN | Union using key |
inner | INNER JOIN | Intersection using keys |
Test:
# Merge two dataframes (left connection) rs = pd.merge(left,right,on='subject_id', how='right') print(rs) # Merge two dataframes (left connection) rs = pd.merge(left,right,on='subject_id', how='outer') print(rs) # Merge two dataframes (left connection) rs = pd.merge(left,right,on='subject_id', how='inner') print(rs)
pandas PivotTable and crosstab
The following data are available:
import pandas as pd left = pd.DataFrame({ 'student_id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], 'student_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung', 'Billy', 'Brian', 'Bran', 'Bryce', 'Betty', 'Emma', 'Marry', 'Allen', 'Jean', 'Rose', 'David', 'Tom', 'Jack', 'Daniel', 'Andrew'], 'class_id':[1,1,1,2,2,2,3,3,3,4,1,1,1,2,2,2,3,3,3,2], 'gender':['M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F'], 'age':[20,21,22,20,21,22,23,20,21,22,20,21,22,23,20,21,22,20,21,22], 'score':[98,74,67,38,65,29,32,34,85,64,52,38,26,89,68,46,32,78,79,87]}) right = pd.DataFrame( {'class_id':[1,2,3,5], 'class_name': ['ClassA', 'ClassB', 'ClassC', 'ClassE']}) # Merge two dataframes data = pd.merge(left,right) print(data)
Pivot table
Pivot table is a common data summary tool in various spreadsheet programs and other data analysis software. It groups and aggregates data according to one or more keys, and summarizes data according to each group.
# With class_id and gender are used to group and summarize data. By default, all columns are aggregated and counted print(data.pivot_table(index=['class_id', 'gender'])) # With class_id and gender are used to group and summarize data and aggregate the score column print(data.pivot_table(index=['class_id', 'gender'], values=['score'])) # With class_id and gender are used for grouping and summarizing data, aggregating the score column, and grouping statistics at the column level for each value of age print(data.pivot_table(index=['class_id', 'gender'], values=['score'], columns=['age'])) # With class_id and gender do grouping and summary data, aggregate and count score columns, and add row and column subtotals for each value column level grouping statistics of age print(data.pivot_table(index=['class_id', 'gender'], values=['score'], columns=['age'], margins=True)) # With class_id and gender do grouping and summary data, aggregate and count score columns, and add row and column subtotals for each value column level grouping statistics of age print(data.pivot_table(index=['class_id', 'gender'], values=['score'], columns=['age'], margins=True, aggfunc='max'))
Cross table
Cross tab is a special pivot table used to calculate grouping frequency:
# By class_id grouping. Count the number of different gender s print(pd.crosstab(data.class_id, data.gender, margins=True))
pandas visualization
Basic drawings: drawings
import pandas as pd import numpy as np import matplotlib.pyplot as mp df = pd.DataFrame(np.random.randn(10,4),index=pd.date_range('2018/12/18', periods=10), columns=list('ABCD')) df.plot() mp.show()
The plot method allows a few plot styles other than the default line graph. These methods can be used as kind keyword parameters of plot(). These include:
- Bar or barh is a bar
- hist is histogram
- Scatter is a scatter diagram
Bar chart
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d']) df.plot.bar() # df.plot.bar(stacked=True) mp.show()
histogram
df = pd.DataFrame() df['a'] = pd.Series(np.random.normal(0, 1, 1000)-1) df['b'] = pd.Series(np.random.normal(0, 1, 1000)) df['c'] = pd.Series(np.random.normal(0, 1, 1000)+1) print(df) df.plot.hist(bins=20) mp.show()
Scatter diagram
df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd']) df.plot.scatter(x='a', y='b') mp.show()
Pie chart
df = pd.DataFrame(3 * np.random.rand(4), index=['a', 'b', 'c', 'd'], columns=['x']) df.plot.pie(subplots=True) mp.show()
Data reading and storage
Read and store csv:
# filepath file path. The string can be a URL. Valid URL schemes include http, ftp and file # sep separator. read_csv defaults to "," and read_table defaults to Tab '[Tab]'. # header receives int or sequence. Indicates that a row of data is used as a column name. The default value is infer, indicating automatic recognition. # names receives an array. Represents the column name. # index_col represents the position of the index column. If the value is sequence, it represents multiple indexes. # dtype represents the data type to be written (the column name is key and the data format is values). # Engine receives c or python. Represents the data parsing engine. The default is c. # nrows receives int. Indicates the first n rows read. pd.read_table( filepath_or_buffer, sep='\t', header='infer', names=None, index_col=None, dtype=None, engine=None, nrows=None) pd.read_csv( filepath_or_buffer, sep=',', header='infer', names=None, index_col=None, dtype=None, engine=None, nrows=None)
DataFrame.to_csv(excel_writer=None, sheetname=None, header=True, index=True, index_label=None, mode='w', encoding=None)
Reading and storing excel:
# io indicates the file path. # sheetname represents the position of data in excel table. The default is 0. # header receives int or sequence. Indicates that a row of data is used as a column name. The default value is infer, indicating automatic recognition. # names represents the position of the index column. If the value is sequence, it represents multiple indexes. # index_col represents the position of the index column. If the value is sequence, it represents multiple indexes. # dtype receives dict. Data type. pandas.read_excel(io, sheetname=0, header=0, index_col=None, names=None, dtype=None)
DataFrame.to_excel(excel_writer=None, sheetname=None, header=True, index=True, index_label=None, mode='w', encoding=None)
Read and store JSON:
# It is converted into a dictionary through the json module and then into a DataFrame pd.read_json('../ratings.json')
Analysis of movielens film scoring data
The requirements are as follows:
-
Read the data, read the user information from the user table, and import the movie scoring table and movie data table in the same way.
-
Consolidated data sheet
-
Preliminary description and analysis of data
-
Check the average score of each film by gender, calculate the difference between differences, and then sort
-
Calculate the average score of each film and sort it
-
View and sort movies with high ratings
-
Filter out movies with less than 250 entries
-
Top ten films
-
The distribution of different ages is viewed and visualized by histogram
-
Mark the age group where the user is located in the original data
-
Visualizing movies_ Frequency of different types of films in ratings
""" demo01_series.py Series Basics """ import numpy as np import pandas as pd # Empty Series s = pd.Series() print(s) # Building Series from data data = np.array(['zs', 'ls', 'ww', 'zl']) s = pd.Series(data) print(s) # Specify the index name (take the student number as the index) to build the Series data = np.array(['zs', 'ls', 'ww', 'zl']) s = pd.Series( data, index=['1001','1002','1003','1004']) print(s) # Create Series from dictionary s = pd.Series({'zs':80, 'ls':90, 'ww':70}) print('---\n', s) # Create Series from scalar s = pd.Series(5, index=['a', 'b', 'c', 'd']) print('---\n', s) # Building Series access elements data = np.array(['zs', 'ls', 'ww', 'zl']) s = pd.Series( data, index=['1001','1002','1003','1004']) print('---\n', s) print('s[2]: ', s[2]) # Accessing elements through index Subscripts print('s[2:]: \n', s[2:]) # Access by index slice print('s["1003"]:', s['1003']) # Access elements through index Tags print(s[['1001', '1002', '1004']]) # Date related processing of test pandas print('-' * 45) dates = pd.Series( ['2011', '2011-02', '2011-03-01', '2011/04/01', '2011/05/01 01:01:01', '01 Jun 2011']) dates = pd.to_datetime(dates) print(dates) print('dates months:') print(dates.dt.month) # Date operation print('-' * 45) delta = dates - pd.to_datetime('1970') print(delta) print(delta.dt.days) # Generate time series print('-' * 45) dates = pd.date_range( '2019-10-01', periods=7, freq='M') print(dates) dates = pd.bdate_range( '2019-11-01', periods=7) print(dates)
""" demo02_dateframe.py Data frame operation """ import numpy as np import pandas as pd # Create an empty df df = pd.DataFrame() print(df) # Create df from list data source print('-' * 45) data = [80, 90, 70, 77] df = pd.DataFrame(data) print(df) # Building df from 2D datasets data = [['Tom', 80], ['Jerry', 70], ['Alex', 25]] df = pd.DataFrame(data, index=['101','102','103'], columns=['name', 'score']) print('-' * 45) print(df) # print('-' * 45) df = pd.DataFrame( [{'name':'zs', 'age':10}, {'name':'ls', 'age':12, 'score':90}]) print(df) # Building df from dictionary data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} df = pd.DataFrame(data, index=['01','02','03','04']) print('-' * 45) print(df) # Column access print('-' * 45) print(df['Name']) print(df[['Name', 'Age']]) # Column addition print('-' * 45) df['score'] = pd.Series( [90, 80, 77, 36], index=df.index) print(df) # Column deletion print('-' * 45) df.pop('score') print(df) # Row access print('-' * 45) print(df[-3:]) print('-' * 45) print(df.loc[['02','04']]) # Access rows through index Tags print('-' * 45) print(df.iloc[[0, 2]]) # Accessing rows through index Subscripts # Row add data = {'Name':['Lily', 'Lucy'], 'Age':[28, 34]} df2 = pd.DataFrame(data, index=['05','06']) df = df.append(df2) print('-' * 45) print(df) # Row deletion df = df.drop(['05', '03']) print('-' * 45) print(df) # Modify element df['Age'] = 40 print(df) df['Age']['04'] = 23 print(df) print(df.values) print(df.head(2)) print(df.tail(2))