# python data analysis tool

Posted by MissiCoola on Mon, 07 Mar 2022 12:37:01 +0100

# python data analysis tool

The data analysis function of python itself is not strong, so we need to install some third-party extension libraries to enhance its corresponding functions.

Extension library related to python data analysis and mining;

expanded memory bankbrief introduction
NumPyProvide array support and corresponding efficient processing functions
SciPyProvide matrix support and matrix related numerical calculation modules
MatplotlibPowerful data visualization tools and drawing library
pandasPowerful and flexible data analysis and exploration tools
StatsModelsStatistical modeling and econometrics, including descriptive statistics, statistical model estimation and inference
scikit-learnSupport powerful machine learning libraries such as regression, classification and clustering
KerasDeep learning library is used to establish neural network and deep learning model
GensimIt is used as a text topic model library, which may be used in text mining

Of course, there are other libraries. For example, the pilot library can be used for image processing, OpenCV can be used for video processing, GMPY2 can be used for high-precision computing, etc. Of course, when dealing with problems, we can search relevant information on the Internet.

If Anaconda distribution is used, many libraries already come with them, such as NumPy, SciPy, Matplotlib, pandas and scikit learn.

Of course, if you use other compilers, you need to install the relevant library files yourself.

# NumPy

• python does not provide the array function. Although the list can complete the basic array function, when the amount of data is large, the speed of using the list will be very slow;
• NumPy provides real array functions and functions for fast data processing;
• NumPy is a dependent Library of many advanced libraries;

Using NumPy to manipulate arrays

import numpy as np

a = np.array([2, 0, 1, 5])  #Create array
print(a)
print(a[:3])
print(a.min())
a.sort()
print(a)
b = np.array(([[1,2,3], [4,5,6]]))
print(b*b)

Output result:
[2 0 1 5]
[2 0 1]
0
[0 1 2 5]
[[ 1  4  9]
[16 25 36]]


# Scipy

The functions of SciPy include optimization, linear algebra, integration, interpolation, fitting, special functions, fast Fourier transform, signal processing and image processing, ordinary differential equation solving and other calculations commonly used in science and engineering.

from scipy.optimize import fsolve #Import functions for solving equations

def f(x):

x1 = x[0]
x2 = x[1]
return [2*x1 - x2**2 - 1, x1**2 - x2 - 2]

result = fsolve(f, [1, 1])
print(result)

from scipy import integrate #Import integral function

def g(x):  #Integral function definition
return (1 - x**2)**0.5

pi_2, err = integrate.quad(g, -1, 1) # Integration results and errors
print(pi_2 * 2)
print(err)

Output result:
[1.91963957 1.68501606]
3.1415926535897967
1.0002354500215915e-09


# Matplotlib

Matplotlib is the most famous drawing library, which is mainly used for two-dimensional drawing. Of course, it can also be used for simple three-dimensional drawing.

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 10, 1000) #Independent variable of drawing
y = np.sin(x) + 1 #Dependent variable y
z = np.cos(x**2) + 1 #Dependent variable z

plt.figure(figsize=(8,4)) #Set image size
plt.plot(x, y, label = '$\sin x+1$', color = 'red', linewidth = 2) #Drawing, setting label, line color and line size

plt.plot(x, z, 'b--', label = '$\cos x^2+1$') #Drawing, setting label and line type
plt.xlabel('Time(s) ') #x-axis name
plt.ylabel('Volt') #y-axis name
plt.title('A Simple Example')
plt.ylim(0,2.2)# y-axis range
plt.legend()#Show Legend
plt.show()


Chinese font needs to be manually specified. The default font is Chinese font.
reference material:

# pandas

pandas is the most powerful data analysis and exploration tool in python. pandas is very powerful;

• Support data addition, deletion, query and modification similar to SQL, and have rich data processing functions;
• Support time series analysis function; Support flexible data processing;

The basic data structures of pandas are Series and DataFrame.

import numpy as np
import pandas as pd

s = pd.Series([1, 2, 3], index=list('abc'))

d = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])

d2 = pd.DataFrame(s)

print(d.describe())

Output result:
a  b  c
0  1  2  3
1  4  5  6
a        b        c
count  2.00000  2.00000  2.00000
mean   2.50000  3.50000  4.50000
std    2.12132  2.12132  2.12132
min    1.00000  2.00000  3.00000
25%    1.75000  2.75000  3.75000
50%    2.50000  3.50000  4.50000
75%    3.25000  4.25000  5.25000
max    4.00000  5.00000  6.00000

#When reading the file, note that the storage path of the file cannot contain Chinese, otherwise the reading may be wrong;
pd.read_csv('data.csv', encoding='utf-8') #When reading data in text format, encoding is generally used to specify the encoding;


Reference documents;

# StatsModels

Pandas focuses on data reading, processing and exploration, while StatsModels pays more attention to data statistical modeling and analysis, which makes Python have the flavor of R language. StatsModels supports data interaction with pandas, so it combines with pandas to become a powerful data mining combination under python.

from statsmodels.tsa.stattools import adfuller as ADF #Import ADF inspection
import pandas as pd
import numpy as np


Output:
(-8.103123291388002, 1.2838791095546032e-12, 1, 98, {'1%': -3.4989097606014496, '5%': -2.891516256916761, '10%': -2.5827604414827157}, 30.91636795599902)


# scikit-learn

This is a library related to machine learning. Scikit learn is a powerful machine learning toolkit under python. It provides a perfect machine learning toolkit, including data preprocessing, classification, regression, clustering, prediction, model analysis and so on.

Scikit learn relies on NumPy, SciPy and Matplotlib.

from sklearn.linear_model import LinearRegression #Import linear regression model

model = LinearRegression() #Establish linear regression model
print(model)


1) The interfaces provided by all models are: model for training model Fit() is fit(X,y) for supervised models and fit(X) for unsupervised models

2) The supervision model provides the following interfaces:

• model.predict(X_new): predict new samples;
• model.predict_proba(X_new): prediction probability, which is only useful for some models (such as LR);
• model.score(): the higher the score, the better the fit;

3) The unsupervised model provides the following interfaces:

• model.transform(): learn new base space from data;
• model.fit_transform(): learn a new base from the data and convert the data according to this set of bases;

Scikit learn itself provides some example data for us to learn. The more common ones are Anderson iris flower data set, handwritten graphic data set and so on.

from sklearn import datasets # Import dataset

print(iris.data.shape)

from sklearn import svm  #Import SVM model

clf = svm.LinearSVC() #Establish SVM classifier
clf.fit(iris.data, iris.target) #Training model with data
clf.predict([[5.0, 3.6, 1.3, 0.25]]) #After inputting the trained model, input new data for prediction
print(clf.coef_) #View the parameters of the trained model

Output result:
(150, 4)
[[ 0.18423149  0.45122757 -0.8079383  -0.45071932]
[ 0.05554602 -0.9001544   0.40811885 -0.96012405]
[-0.85077276 -0.98663003  1.3810384   1.86530666]]


# Keras

Artificial intelligence neural network is a powerful but simple model. It plays an important role in the fields of language processing, image recognition and so on. Keras library can be used to build neural networks. In fact, keras is not a simple neural network library, but a powerful deep learning library based on Theano. It can be used to build not only ordinary neural networks, but also various deep learning models, such as self encoder, cyclic neural networks, recursive neural networks, convolutional neural networks, etc.
To use Keras, you need to install TensorFlow package.

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD

model = Sequential()#Model initialization
model.add(Dense(20,64))#Add the connection of the input layer (20 nodes) and the first hidden layer (64 nodes)
model.add(Activation('tanh'))#The first hidden layer uses tanh as the activation function
model.add(Dense(64, 64))#Add the connection of the first hidden layer (64 nodes) and the second hidden layer (64 nodes)
model.add(Dense(64,1))#Add the connection of the second hidden layer (64 nodes) and the output layer (1 node)
model.add(Activation('sigmoid'))#The output layer uses sigmoid as the activation function

sgd = SGD(lr=0.1, decay=1e-6,momentum=0.9,nesterov=True)#Define solution algorithm
model.compile(loss='mean_squraed_error', optimizer=sgd)#Compile and generate the model, and the loss function is the sum of squares of the average error

model.fit(X_train, y_train, nb_epoch=20,batch_size=16)#Training model
score = model.evaluate(X_test, y_test, batch_size=16)#test model


reference material;

# Gensim

Gensim is used to deal with language tasks, such as text similarity calculation, LDA, Word2Vec, etc. tasks in these fields often need more background knowledge.

import gensim, logging

logging.basicConfig(format='%(asctime)s: %(levelname)s : %(message)s', level=logging.INFO)
#logging is used to output training logs
#Divide the sentences into words, and input each sentence in the form of word list
sentences=[['first', 'sentence'],['secend','sentence']]
#The above sentence is trained with vector model
model = gensim.models.Word2Vec(sentences, min_count=1)
print(model['sentence'])#Output the word vector of the word sentence


reference material: