Graduation project - big data analysis: Stock Forecasting and analysis based on time series

Posted by oskom on Sat, 18 Dec 2021 13:08:23 +0100

1 Introduction

Hi, everyone, this is senior student Dan Cheng. Today I'd like to introduce you to a big data project

Big data analysis: Stock Forecasting Analysis Based on time series

You can use it for graduation design

Technical solutions, problem opening guidance and design completion help
 Design help:<Q>746876041

2 origin of time series

When it comes to time series analysis technology, we have to talk about the AR/MA/ARMA/ARIMA analysis model. The common feature of these four analysis methods is to jump out of the analysis angle of changing components, and strive to obtain the quantitative relationship between the early data and the later data from the time series itself, so as to establish a model in which the early data is the independent variable and the later data is the dependent variable, so as to achieve the purpose of prediction. Let's use a popular metaphor. You the day before yesterday, you the day before yesterday and you yesterday have created you today.

2.1 names of four models:

AR model: Auto Regressive model;
MA model: Moving Average model;
ARMA: Auto Regressive and Moving Average model;
ARIMA model: differential autoregressive moving average model.
AR model:

If any value of a time series can be expressed as the following regression equation, the time series follows an autoregressive process of order p, which can be expressed as AR §:

The AR model uses the correlation between previous and later values (autocorrelation) to establish a regression equation including previous and later values to achieve the purpose of prediction, so it becomes an autoregressive process. Here, it is necessary to explain the white noise. The white noise can be understood as the random fluctuation of time series values, and the sum of these random fluctuations will be equal to 0. For example, an automatic biscuit production line requires 500 grams per package of biscuits, but it is produced Due to the influence of random factors, biscuit products cannot be accurately equal to 500g, but will fluctuate up and down at 500g. The sum of these fluctuations will offset each other, equal to 0.

3 Data Preview

import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline

#Prepare two arrays
list1 = [6,4,8]
list2 = [8,6,10]

#Convert list1 and list2 to Series array respectively
list1_series = pd.Series(list1) 
print(list1_series)
list2_series = pd.Series(list2) 
print(list2_series)

#Convert the two Series to DataFrame, and the corresponding column names are A and B respectively
frame = { 'Col A': list1_series, 'Col B': list2_series } 
result = pd.DataFrame(frame)

result.plot()
plt.show()

4 theoretical formula

4.1 covariance

Let's first look at the covariance formula:

4.2 correlation coefficient

After Cov is calculated, the correlation coefficient can be calculated. The value is between - 1 and 1. The closer it is to 1, the greater the positive correlation is; The closer to - 1, the greater the negative correlation, and 0 is no correlation
The formula is as follows:

4.3 scikit learn calculation correlation

#Matrix diagram of the relationship between features
sns.pairplot(iris, hue='species', size=3, aspect=1)

Andrews Curves is a way to visualize multidimensional data by mapping each observation to a function.
Andrew curves is used to convert each multivariable observation value into a curve and represent the coefficients of Fourier series, which is very useful for detecting outliers in time series data.

plt.subplots(figsize = (10,8))
pd.plotting.andrews_curves(iris, 'species', colormap='cool')

Take the classic iris data set as an example

setosa, versicolor and virginica represent three varieties of iris. It can be seen that there are intersections between various features and certain separate laws.

#Finally, find out the correlation between different features in the data set through the heat map. High positive or negative values indicate that the features are highly correlated:

fig=plt.gcf()
fig.set_size_inches(10,6)
fig=sns.heatmap(iris.corr(), annot=True, cmap='GnBu', linewidths=1, linecolor='k', \
square=True, mask=False, vmin=-1, vmax=1, \
cbar_kws={"orientation": "vertical"}, cbar=True)

5 time series analysis of financial data

This paper mainly introduces: time series change calculation, time series resampling and window function

5.1 data overview

import pandas as pd

tm = pd.read_csv('/home/kesci/input/gupiao_us9955/Close.csv')
tm.head()

Meaning of each indicator in the data:

AAPL.O | Apple Stock
MSFT.O | Microsoft Stock
INTC.O | Intel Stock
AMZN.O | Amazon Stock
GS.N | Goldman Sachs Stock
SPY | SPDR S&P 500 ETF Trust
.SPX | S&P 500 Index
.VIX | VIX Volatility Index
EUR= | EUR/USD Exchange Rate
XAU= | Gold Price
GDX | VanEck Vectors Gold Miners ETF
GLD | SPDR Gold Trust

List of price (or index) trend in 8 years

5.2 calculation of sequence change

Calculate the difference value of each indicator of each day (the result of the previous day minus the latter day)
Calculate pct_change: growth rate, i.e. (last value - previous value) / previous value)
Calculate average pct_change index
Plot to see which indicator has the highest average growth rate
Calculate the growth rate of continuous time (where the difference between today's price and yesterday's price needs to be calculated)

Calculate the difference value of each indicator of each day (the result of the previous day minus the latter day)

Calculate pct_change: growth rate, i.e. (last value - previous value) / previous value)

Calculate average pct_change index
Plot to see which indicator has the highest average growth rate

In addition to the highest growth rate of volatility index (. VIX Index), Amazon's stock price! Bezos is the strongest bald head in the universe

Calculate the growth rate of continuous time (where the difference between today's price and yesterday's price needs to be calculated)

#Day 2 data
tm.shift(1).head()

#Calculate growth rate
rets = np.log(tm/tm.shift(1))
print(rets.tail().round(3))

#cumsum chestnuts:
print('Results of small chestnuts:',np.cumsum([1,2,3,4]))

#For cumsum of growth rate, you need to restore the log with e^x
rets.cumsum().apply(np.exp).plot(figsize=(10,6))

The above is the growth rate in a continuous period, that is, one yuan in 2010 has become more than 10 yuan by 2018 (take Amazon as an example)

(to be continued, this project is a demo prediction part. Some students need to contact the senior to complete it)

last

Technical solutions
 Design help:<Q>746876041

Follow the senior students' public-private-account to get more information on the completion of the project:

Topics: Big Data

Programmer Think