Start learning data analysis task01 data loading and exploratory data analysis

Posted by Jtech on Tue, 11 Jan 2022 17:19:13 +0100

Chapter II

Section I data loading and preliminary observation

1.1 loading data

1.1.1 task 1: import numpy and pandas

import numpy as np
import pandas as pd

1.1.2 task 2: load data

(1) Load data using relative paths

#Import relative path data
df=pd.read_csv('train.csv')

(2) Load data using absolute path

#Absolute path import data
path=os.path.abspath('train.csv')
df = pd.read_csv(path)

extend

(1) Displays the number of table rows and columns

#Number of rows and columns
df.shape

(2) Transpose table

#Transpose
df.T

Before transpose

After transposition

(3)read_table has no separation by default

#No delimited data
pd.read_table(path)

read_csv is separated by commas by default, and table is separated by commas. You need to set parameters

#Comma separator
pd.read_table(path,sep=',')

1.1.3 task 3: read one data module every 10 rows block by block

#Block by block reading
df = pd.read_csv('train.csv',chunksize=10)
df.get_chunk()


Block by block reading is to truncate the reading analysis of long files.

Comparison of block by block read type and non block by block read type:

Block by block reading is not allowed for head display.

1.1.4 task 4: change the header to Chinese and the index to passenger ID [for some English materials, we can get familiar with our data more intuitively through translation]

Passengerid = > passenger ID
Survived = > survived
Pclass = > passenger class (Class 1 / 2 / 3)
Name = > passenger name
Sex = > gender
Age = > age
Sibsp = > number of cousins / sisters
Parch = > number of parents and children
Ticket = > ticket information
Fare = > fare
Cabin = > cabin
Embarked = > port of embarkation

The first method:

#Change header to Chinese
df.columns = ['passenger ID','Survive','Passenger class(1/2/3 Class space)','Passenger name','Gender','Age','male cousins/Number of sisters','Number of parents and children','Ticket information','Ticket Price','passenger cabin','Boarding port']


The second method:

#Other methods: add one more line
df = pd.read_csv('train.csv',names=['passenger ID','Survive','Passenger class(1/2/3 Class space)','Passenger name','Gender','Age','male cousins/Number of sisters','Number of parents and children','Ticket information','Ticket Price','passenger cabin','Boarding port'])
df


Change the name when reading the file, and add an additional column at the end. The previous list name becomes a row in it. This method is generally not adopted. (easily confused data)

1.2 preliminary observation

1.2.1 task 1: View basic information of data

give an example:

#1. View basic data information
df.info()

#2. View basic information
df.describe()
#mean std standard deviation

1.2.2 task 2: observe the data in the first 10 rows and the data in the last 15 rows of the table

head() and tail() display 5 columns by default.

#Observe the data in the first 10 rows of the table
df.head(10)


#Observe the data in the last 15 rows of the table
df.tail(15)

1.2.4 task 3: judge whether the data is empty, return True where it is empty, and return False in other places

#Judge whether the data is empty, return True where it is empty, and return False in other places
df.isnull()

1.3 saving data

1.3.1 task 1: save the data you loaded and changed as a new file train in the working directory_ chinese. csv

df.to_csv('train_chinese.csv')

Section II pandas Foundation

1.4 know what your data is

1.4.1 task 1: there are two data types in pandas, DateFrame and Series. You can simply understand them by searching. Then write a small example about these two data types 🌰 [open question]

Series

First example:

#Series one-dimensional array type with index, random gets random number
s=pd.Series(np.random.randn(5),index=['a','b','c','d','e'])


Second example:

The third example:

#Dictionary form
s=pd.Series({'b':1,'a':0,'e':2})
s

DataFrame

The fourth example:

#DataFrame two-dimensional array, which can be generated by Series
d = {'one' : pd.Series([1.,2.,3.],index=['a','b','c']),'two': pd.Series([4.,5.,6.,7.],index=['a','b','c','d'])}
df = pd.DataFrame(d)
df

1.4.2 task 2: load the "train.csv" file according to the method of the previous lesson

Relative path introduction is used here.

df=pd.read_csv('train.csv')

1.4.3 task 3: view the name of each column of DataFrame data

df.columns

1.4.4 task 4: check all values in the "bin" column [there are many methods]

1.4.4 Task 4: View"Cabin"All values in this column [There are many ways]

The first method:

df.Cabin


The second method:

df['Cabin']


Type view:

Can become DataFrame type:

#Become DataFrame type
df[['Cabin']]

1.4.5 task 5: load the file "test_1.csv", then compare "train.csv" to see what extra columns are, and then delete the extra columns

Load the file "test_1.csv"

test_1 = pd.read_csv('test_1.csv')
test_1.head()


You can find that column a is redundant. Delete the extra column a:
The first way:

#Delete extra columns
del test_1['a']
test_1.head()

The second way:

test_1.pop('a')
test_1.head()

The third way:

#A copy of the array of deleted columns is generated, and the file itself does not delete columns
test_1.drop(['a'],axis=1)

The fourth way:

#Do not return the copy, save it directly into the original file
test_1.drop(['a'],axis=1,inplace = True)
test_1.head()

1.4.6 task 6: hide the ['PassengerId', 'Name', 'Age', 'Ticket] column elements, and only observe the other column elements

#Column element hiding
test_1.drop(['PassengerId','Name','Age','Ticket'],axis=1)

1.5 logic of screening

1.5.1 task 1: we take "Age" as the screening condition to display the information of passengers under the Age of 10.

#Displays information about passengers under the age of 10
df[df["Age"]<10]

1.5.2 task 2: display the information of passengers aged over 10 and under 50 under the condition of "Age", and name the data as middle

#The information of passengers aged over 10 and under 50 is displayed, and the data is named middle
midage = df[(df["Age"]>10)& (df["Age"]<50)]
midage.head()

midage1 = df[(df["Age"]>10)|(df["Age"]<50)]
midage1.head()


Differences between comparison & and filtering data:

1.5.3 task 3: display the data of "Pclass" and "Sex" in the 100th row of middle data

#No reset_index will cause an index error and find the original data index 100 instead of the 100th row in the middle
midage.loc[[100],['Pclass','Sex','Name']]

midage1 = midage.reset_index(drop=True)
midage1.head()
midage1.loc[[100],['Pclass','Sex','Name']]


We can compare the two indexes to see the problem:

As we can see, without reset_index will cause an index error and find the original data index number instead of the index number in the set middle

1.5.4 task 4: use the loc method to display the data of "Pclass", "Name" and "Sex" in line 100105108 of the middle data

#Use the loc method to display the data of "Pclass", "Name" and "Sex" in line 100105108 of the middle data
midage1.loc[[100,105,108],['Pclass','Name','Sex']] 

1.5.5 task 5: use iloc method to display the data of "Pclass", "Name" and "Sex" in line 100105108 of middle data

#iloc index
#Use the iloc method to display the data of "Pclass", "Name" and "Sex" in line 100105108 of the middle data
midage1.iloc[[100,105,108],[2,3,4]]


The comparison shows that when iloc method is used, the parameter is the index number of its column.

Section III exploratory data analysis

Import the required packages and data first:

1.6 do you know your data?

1.6.1 task 1: use Pandas to sort the sample data in ascending order

#Ascending sort
sample= pd.DataFrame(np.random.randn(3, 3), 
                     index=list('213'), 
                     columns=list('bca'))

#Sort a column from small to large
sample.sort_values('b')

# Make row index sort ascending
sample.sort_index()

# Sort column index in ascending order
sample.sort_index(axis=1)

# Sort column index in descending order
sample.sort_index(axis=1, ascending=False)

# Let any two columns of data be sorted in descending order at the same time. If it is impossible to sort at the same time, the first column in by will be selected first
sample.sort_values(by=['c', 'a'], ascending=False)

1.6.2 task 2: comprehensively sort the Titanic data (trian.csv) by fare and age (in descending order). What can you find from the data

#The Titanic data (trian.csv) is comprehensively sorted by ticket price and age (in descending order)
text.sort_values(by=['Ticket Price','Age'], ascending=False)


It is found that priority is given to ranking according to ticket prices, and then ranking on the basis of ticket prices according to age. And the higher the ticket price, the lower the mortality rate, among which none of the people with the highest ticket price died.

1.6.3 task 3: use Pandas to perform arithmetic calculation and calculate the data addition result of two dataframes

#Use Pandas for arithmetic calculation and calculate the data addition result of two dataframes
x= pd.DataFrame(np.random.randn(3, 3), 
                     index=list('213'), 
                     columns=list('bca'))
x

y= pd.DataFrame(np.random.randn(3, 3), 
                     index=list('413'), 
                     columns=list('bcd'))
y

#Only when the rows and columns are the same can the calculation result be obtained
x+y

1.6.4 task 4: how to calculate the number of the largest family on board through the Titanic data?

#How to calculate the number of the largest family on board from the Titanic data
max(text['male cousins/Number of sisters'] + text['Number of parents and children'])

1.6.5 task 5: learn to use the Pandas describe() function to view basic statistics of data

#Use the Pandas describe() function to view the basic statistics of data. The first section has been run
text.describe()


The data display is not very intuitive. We can introduce histogram to see:

#Draw histogram
from matplotlib import pyplot as plt
#Pay attention to whether there is a blank value. If there is a blank value, an error will be reported, such as age
plt.hist(text['Survive'])


Using the histogram, we can clearly see the number of survivors and the number of deaths.

1.6.6 task 6: look at the basic statistics of ticket prices, parents and children in the Titanic data set. What can you find?

First, let's analyze the ticket price:

text['Ticket Price'].describe()


The average ticket price is 32 yuan, but the maximum ticket price is 512 yuan. We can still use charts to observe.

plt.hist(text['Ticket Price'])


We can find that the ticket price distribution has a long tail.

The number of parents and children is analyzed below:

text['Number of parents and children'].describe()


We can also see that his distribution has a long tail, with an average number of 0.38 But the maximum number is 6. The following is a visual representation of the histogram:

plt.hist(text['Number of parents and children'])


We can clearly see that most people travel alone.

[summary]

In this chapter, we make a preliminary statistical view of the data through the operation of the basic function, and gradually build the thinking of how to analyze the data.

Topics: Python Data Analysis Data Mining