Before forecasting the time series, we need to carry out a series of tests on the data, mainly to test the stability and randomness of the data (white noise test). This paper mainly introduces ADF test and Ljung box test
ADF test
ADF test, i.e. unit root test, refers to whether there is a unit root in the test series, because the existence of a unit root is a non-stationary time series. Unit root refers to the unit root process. It can be proved that if there is a unit root in the sequence, the process is unstable, which will lead to pseudo regression in regression analysis.
The python code of ADF verification is given below
from statsmodels.tsa.stattools import adfuller import pandas as pd import numpy as np data = pd.Series([151.0, 188.46, 199.38, 219.75, 241.55, 262.58, 328.22, 396.26, 442.04, 517.77, 626.52, 717.08, 824.38, 913.38, 1088.39, 1325.83, 1700.92, 2109.38, 2499.77, 2856.47, 3114.02, 3229.29, 3545.39, 3880.53, 4212.82, 4757.45, 5633.24, 6590.19, 7617.47, 9333.4, 11328.92, 12961.1, 15967.61],index=np.arange(1978,2011)) re=adfuller(data) print(re)
(-0.04391111656553118, 0.9547464774274733, 10, 22, {'1%': -3.769732625845229, '5%': -3.005425537190083, '10%': -2.6425009917355373}, 291.54354258641223)
The results are analyzed as follows:
-0.04391111656553118 is the result of adt test, referred to as t value for short, representing T statistics.
0.9547464774274733 is abbreviated as p-value, which represents the probability value corresponding to t-statistic.
10 indicates delay.
22 indicates the number of tests.
T he fifth is the value of critical ADF test under 99%, 95% and 90% confidence intervals.
291.54354258641223 maximum hysteresis threshold
Firstly, - 0.04391111656553118 is greater than the critical value of three confidence intervals, that is, there is a unit root.
Secondly, the p value is required to be less than the given significance level (generally 0.05), and it is best to be equal to 0. In this data, the p value is 0.9547464774274733, which is greater than 0.05, that is, there is a unit root.
So to sum up, this sequence is not a stationary sequence
The results of stationary series are given below
(-4.924087490679005, 3.129856642757301e-05, 19, 636, {'1%': -3.4406737255613256, '5%': -2.866095119842903, '10%': -2.5691958123689727}, 14356.744057311003)
T value is less than the critical value of three confidence intervals, and P value is less than 0.05, close to 0, so there is no unit root and it is a stationary sequence.
Ljung box test
Ljung box test, i.e. LB Test and randomness test, is used to test whether the autocorrelation of the sequence within the m-order lag range is significant or whether the sequence is white noise. The Q statistics obey the chi square distribution with degree of freedom M. If it is white noise data, the data has no value to extract, that is, there is no need to continue the analysis
The python code of Ljung box test is given below
from statsmodels.stats.diagnostic import acorr_ljungbox as lb_test re = lb_test(data, lags=20)#Use blogger's own data prinit(re)
lb_stat lb_pvalue 1 471.099659 1.847036e-104 2 899.481638 4.786785e-196 3 1347.384204 7.695651e-292 4 1791.734228 0.000000e+00 5 2207.199800 0.000000e+00 6 2674.155719 0.000000e+00 7 3242.923906 0.000000e+00 8 3686.776794 0.000000e+00 9 4069.902008 0.000000e+00 10 4474.462678 0.000000e+00 11 4865.867510 0.000000e+00 12 5234.470249 0.000000e+00 13 5641.097308 0.000000e+00 14 6133.124076 0.000000e+00 15 6518.637784 0.000000e+00 16 6846.243758 0.000000e+00 17 7193.271970 0.000000e+00 18 7526.968985 0.000000e+00 19 7836.234889 0.000000e+00 20 8179.147428 0.000000e+00
The results are analyzed as follows
We mainly look at the p value of the second column. lags is the delay number of the test. It is generally specified as 20 or the sequence length. Each P value is less than 0.05 or equal to 0, indicating that the data is not white noise data. The data is valuable and can be analyzed further.
On the contrary, if it is greater than 0.05, it indicates that it is a white noise sequence and a pure random sequence.