Data Analysis Day 4 - Data Extraction, Data Merging and Data Computing

Posted by slightlyeskew on Wed, 24 Jul 2019 12:39:15 +0200

Start

After cleaning and transforming the imported data, we need to further extract, merge and calculate the data we need. Today, let's talk about these three steps, and then the data processing part is almost finished. We can start data analysis.

1. Data extraction

Data extraction, also known as data splitting, refers to retaining, extracting some data from the original data table to form a new data table. The main methods are field splitting, record extraction and random extraction.

1.1 field splitting

Extracting part of information from a field to form a new field

1.1.1 Split by Position

In pandas, we can extract and split a piece of data by location using str.slice function (only character data can be processed)
It has only two parameters, starting position and terminating position, excluding the value of terminating position.

import pandas as pd

#Import data
data=pd.read_csv('D:/Field Splitting.csv',encoding='GBK')
#View data
print(data)
#Converting Number Column Data to Character Type after Confirming that there is no missing
data['number']=data['number'].astype(str)
#The first three numbers are coded by operators
data['Operator Coding']=data['number'].str.slice(0,3)
#The next four bits are the encoding of the location.
data['Area coding']=data['number'].str.slice(3,7)
#The last four bits are user codes
data['User Coding']=data['number'].str.slice(7,11)
#View the data after adding sharding encoding
print(data)


1.1.2 split by separator

Splitting our commonly used split function by separator
It has three main parameters.
pat: Separator, default space
n: Divided into n+1 columns, default is - 1 to return all delimited columns
Expand: Whether to expand as a data box, default to False

import pandas as pd

#Import data
data=pd.read_csv('D:/Field Splitting.csv',encoding='GBK')
#View data
print(data)
newdata=data['Commodity Name'].str.split(' ',1,True)
newdata.columns=['Company name','Commodity Name']


1.1.3 Time Attribute Separation

Time attributes are often needed to be processed in data processing. How to extract the time attributes from the original data into the style we need more is to split the time attributes.
Generally, to_datetime function is used to transform the corresponding data into time-dependent variables, and then extract the corresponding data information.
The to_datetime function looks at the end of the third day

import pandas as pd 
time=pd.read_csv('D:/Time processing.csv',encoding='GBK')
time['time']=pd.to_datetime(time.Registration time,format='%Y/%m/%d  %H:%M:%S')

time['year']=time['time'].dt.year
time['month']=time['time'].dt.month
time['week']=time['time'].dt.weekday
time['day']=time['time'].dt.day
time['time']=time['time'].dt.hour
time['branch']=time['time'].dt.minute
time['second']=time['time'].dt.second

1.2 Record Extraction

Extracting data according to specified conditions is similar to filtering

1.2.1 Keyword Extraction

Keyword extraction from data using contains function
There are three common parameters
pat: The keyword to be extracted
Case: Ignore case, default to True
na: The missing value is populated with, by default, NaN

1.2.2 Null Value Extraction

isnull function can be used to extract row information with null column values

1.2.3 Data Range Extraction

In general, <,>,<=,>=,!=, you can also use the between function.
between(left,right,inclusive=True)
left is the minimum, right is the maximum, inclusive contains the minimum and maximum, default to True

# -*- coding: utf-8 -*-
import pandas as pd
#Import data
data=pd.read_csv('D:/data processing.csv',encoding='GBK')

#Keyword extraction, using contains function
#Find all men's information
male=data[data.Gender.str.contains('male',na=False)]

#isnull function is used to extract the null value of age
null=data[data.Age.isnull()]

#Data range extraction to find information about people older than 25 years old
older=data[data.Age>25]
#Find information about people aged between 25 and 28
mid=data[data.Age.between(25,28)]

We can also combine multiple conditions, using logical operators &, |, ~join




1.3 Random Sampling

Generally, they are divided into playback sampling and non-playback sampling.
Saple function can be used in general.
n: Sampling by Number
frac: Sampling by percentage
replace: False has no playback sampling, True has playback sampling, default is False.

#Sampling data by number
data1=data.sample(3)
#Sampling data by percentage
data2=data.sample(frac=0.3)


2. Data merging

Data merging refers to merging different data in a table into a new table.

2.1 Record Merging

We will get a lot of new data in data extraction, and merging them will require record merging.
Using concat function, the data boxes to be merged are passed in as lists, and new tables can be obtained.

import pandas as pd
#Import data
data=pd.read_csv('D:/data processing.csv',encoding='GBK')
#Sampling data by number
data1=data.sample(3)
#Sampling data by percentage
data2=data.sample(frac=0.3)

#For dat1,data2 is merged into new data
newdata=pd.concat([data1,data2])

2.2 field matching

It can also be called horizontal merging, which connects two data boxes.
In general, merge function is used. It has five parameters.
Left: The data box on the left
Right: The data box on the right
left_on: The column name of the left data box used in the connection
right_on: Column name of the right data box used in the connection
how: Default is inner (inner connection) and left (left connection), right (right connection), outer (outer connection)

3. Data Computation

Do some calculations with existing data to get new data

3.1 Simple calculation

You can refer to the use of DataFrame operations in the next day, which is similar here.

3.2 Time calculation

Usually import the datetime module to calculate the corresponding data

import pandas as pd
from datetime import datetime
#Import data
time=pd.read_csv('D:/Time processing.csv',encoding='GBK')
time['time']=pd.to_datetime(time.Registration time,format='%Y/%m/%d  %H:%M:%S')
time['Days of registration']=(datetime.now()-time['time']).dt.days

3.3 Data standardization

The most common way to scale up data is 0-1 standardization.
Using the round function, round(number,ndigits)
number is the data to be processed, ndigits are the decimal digits to be preserved

#0-1 standardization of registration days
time['Standardization of registration days']=round(
        (time.Days of registration-time.Days of registration.min()
        )/(time.Days of registration.max()-time.Days of registration.min()),2)

3.4 Data Grouping

Grouping data in equidistant or non-equidistant ways is also called data discretization.
Grouping data using cut function
cut(x,bins,right,labels)
x: Columns that need to be grouped
bins: A list of custom groupings
Right: Is the right side of the grouping interval closed by default?
Labels: Custom labels for groups that can be undefined

#Grouping the registration time
bins=[80,130,180,230]
time['Registration time grouping']=pd.cut(time.Days of registration,bins)

summary

At this point, the data processing part is over, familiar with the basic processing, then you can go to the most interesting data analysis link. Then I have a small crawler project, which I can share after writing in a few days. Now I'm going to play ball.

Topics: encoding Attribute