Kaggle time series tutorial 3 - seasonality

Posted by roach on Tue, 15 Feb 2022 04:39:48 +0100

What is seasonality?

As long as the average value of the series changes regularly and periodically, we say that the time series shows seasonality. Seasonal changes usually follow the clock and calendar - usually a repetition of a day, week, or year. Seasonality is usually driven by the cycle of nature in a few days and years or social behavior conventions around dates and times.

Seasonality in four time series.

We will learn about two seasonal characteristics. The first, indicators, is most suitable for a seasonal cycle with a small number of observations. For example, find the seasonal cycle in weeks in the daily observations. The second, Fourier features, is most suitable for many observations in a seasonal cycle. For example, find the seasonality with a period of years in the daily observations.

Seasonal plots and seasonal indicators

Just as we use the moving average chart to find trends in the series, we can use the seasonal chart to find seasonality.

The seasonal chart shows time series segments drawn for a common period, which is the "season" you want to observe. This figure shows a seasonal chart of daily views of Wikipedia articles on trigonometry: the daily views of articles are drawn during a common weekly period.

This series has obvious seasonality with weekly cycle, with higher working days and lower weekends.

Seasonal indicators

Seasonal indicators are binary features that represent seasonal differences in the level of a time series. If you treat the seasonal cycle as a classification feature and code it separately, you can get the seasonal indicator.

By coding each day of the week for unique heat, we get the seasonal indicator of the week. Creating a weekly seasonal indicator for the trigonometry series will provide us with six new "virtual" features.
(if one of the indicators is deleted, the linear regression effect will be better; so we chose to delete Monday in the table below.)

Date	Tuesday	Wednesday	Thursday	Friday	Saturday	Sunday
2016-01-04	0.0	0.0	0.0	0.0	0.0	0.0
2016-01-05	1.0	0.0	0.0	0.0	0.0	0.0
2016-01-06	0.0	1.0	0.0	0.0	0.0	0.0
2016-01-07	0.0	0.0	1.0	0.0	0.0	0.0
2016-01-08	0.0	0.0	0.0	1.0	0.0	0.0
2016-01-09	0.0	0.0	0.0	0.0	1.0	0.0
2016-01-10	0.0	0.0	0.0	0.0	0.0	1.0
2016-01-11	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...

Adding a seasonal indicator to the training data helps the model identify the average value in the seasonal cycle:

The average value of each time in the learning season of ordinary linear regression.

The indicator is like a switch. At most one of these indicators has a value of "1" (on) at any time. Linear regression learns a benchmark value of 2379 for Monday, and then adjusts the value according to which indicator is on that day; For the rest of the indicators, since the value is 0, the value will not be calculated.

Fourier Features and periodogram

The features we are discussing now are more suitable for long seasonal cycles with many observations. In this case, it is unwise to use indicators (recall our previous indicators with a cycle of weeks, there will be six more features in seven days a week, and if there are too many observations, there will be many more features!). Instead of creating one feature for each date, Fourier features try to use several features to capture the overall shape of the seasonal curve.

Let's take a look at the annual seasonal chart in trigonometry. Pay attention to the repetition of various frequencies: three long up and down exercises a year, 52 short week exercises a year, and maybe others.

Annual seasonal variation of trigonometry sequence

We try to use Fourier features to capture these frequencies in a season. The idea is to include in our training data periodic curves with the same frequency as the season we are trying to model. The curves we use are the sine and cosine curves of trigonometric functions.

Fourier features are paired sine and cosine curves, and each potential frequency corresponds to a pair from the longest season. Fourier pairs that model annual seasonality will have frequencies: once a year, twice a year, three times a year, and so on.

The first two Fourier pairs of annual seasonality. Above: the frequency is once a year. Below: frequency: twice a year.

If we add a set of these sine / cosine curves to our training data, the linear regression algorithm will calculate the weight suitable for the seasonal component in the target sequence. This figure illustrates how linear regression uses four Fourier pairs to simulate the annual seasonality in the trigonometry series.

Upper: the curve of four Fourier pairs, the sum of sine and cosine and regression coefficient. Each curve simulates a different frequency. Below: the sum of these curves is similar to the seasonal pattern.

Note that we only need eight features (four sine / cosine pairs) to estimate the annual seasonality well. Compare with seasonal indicator methods that require hundreds of features (one per day of the year). By using only Fourier features to model the seasonal "main effect", fewer features are added to the training data, which means that the calculation time is reduced and the risk of over fitting is reduced.

Selecting Fourier features using periodogram

How many features should we actually include in the Fourier set? We can answer this question with a periodic graph. The periodogram tells you the intensity of the frequency in the time series. Specifically, the value on the y-axis of the graph is (a ** 2 + b ** 2) / 2, where a and B are the coefficients of sine and cosine at this frequency (as shown in the Fourier Components diagram above).

*Periodogram of trigonometry *

From left to right, the periodogram drops after Quarterly, four times a year. This is why we chose four Fourier pairs to simulate the annual season. We ignore the Weekly frequency because it is better modeled using seasonal indicators.

Calculate Fourier characteristics (optional)

Understanding how Fourier features are calculated is not essential for using them, but if you see the details, you can better understand it. The following cells illustrate how to export a set of Fourier features from the index of time series. (however, we will use library functions from statsmodels in our application.)

import numpy as np


def fourier_features(index, freq, order):
    time = np.arange(len(index), dtype=np.float32)
    k = 2 * np.pi * (1 / freq) * time
    features = {}
    for i in range(1, order + 1):
        features.update({
            f"sin_{freq}_{i}": np.sin(i * k),
            f"cos_{freq}_{i}": np.cos(i * k),
        })
    return pd.DataFrame(features, index=index)


# Compute Fourier features to the 4th order (8 new features) for a
# series y with daily observations and annual seasonality:
#
# fourier_features(y, freq=365.25, order=4)

Example - Tunnel Traffic

We will continue to use the Tunnel Traffic dataset. This hidden cell loads data and defines two functions: seasonal_plot and plot_periodogram.

from pathlib import Path
from warnings import simplefilter

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess

simplefilter("ignore")

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True, figsize=(11, 5))
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=16,
    titlepad=10,
)
plot_params = dict(
    color="0.75",
    style=".-",
    markeredgecolor="0.25",
    markerfacecolor="0.25",
    legend=False,
)
%config InlineBackend.figure_format = 'retina'


# annotations: https://stackoverflow.com/a/49238256/5769929
def seasonal_plot(X, y, period, freq, ax=None):
    if ax is None:
        _, ax = plt.subplots()
    palette = sns.color_palette("husl", n_colors=X[period].nunique(),)
    ax = sns.lineplot(
        x=freq,
        y=y,
        hue=period,
        data=X,
        ci=False,
        ax=ax,
        palette=palette,
        legend=False,
    )
    ax.set_title(f"Seasonal Plot ({period}/{freq})")
    for line, name in zip(ax.lines, X[period].unique()):
        y_ = line.get_ydata()[-1]
        ax.annotate(
            name,
            xy=(1, y_),
            xytext=(6, 0),
            color=line.get_color(),
            xycoords=ax.get_yaxis_transform(),
            textcoords="offset points",
            size=14,
            va="center",
        )
    return ax


def plot_periodogram(ts, detrend='linear', ax=None):
    from scipy.signal import periodogram
    fs = pd.Timedelta("1Y") / pd.Timedelta("1D")
    freqencies, spectrum = periodogram(
        ts,
        fs=fs,
        detrend=detrend,
        window="boxcar",
        scaling='spectrum',
    )
    if ax is None:
        _, ax = plt.subplots()
    ax.step(freqencies, spectrum, color="purple")
    ax.set_xscale("log")
    ax.set_xticks([1, 2, 4, 6, 12, 26, 52, 104])
    ax.set_xticklabels(
        [
            "Annual (1)",
            "Semiannual (2)",
            "Quarterly (4)",
            "Bimonthly (6)",
            "Monthly (12)",
            "Biweekly (26)",
            "Weekly (52)",
            "Semiweekly (104)",
        ],
        rotation=30,
    )
    ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))
    ax.set_ylabel("Variance")
    ax.set_title("Periodogram")
    return ax


data_dir = Path("../input/ts-course-data")
tunnel = pd.read_csv(data_dir / "tunnel.csv", parse_dates=["Day"])
tunnel = tunnel.set_index("Day").to_period("D")

Let's look at the seasonal chart of a week and a year.

X = tunnel.copy()

# days within a week
X["day"] = X.index.dayofweek  # the x-axis (freq)
X["week"] = X.index.week  # the seasonal period (period)

# days within a year
X["dayofyear"] = X.index.dayofyear
X["year"] = X.index.year
fig, (ax0, ax1) = plt.subplots(2, 1, figsize=(11, 6))
seasonal_plot(X, y="NumVehicles", period="week", freq="day", ax=ax0)
seasonal_plot(X, y="NumVehicles", period="year", freq="dayofyear", ax=ax1);

Now let's look at the periodic chart:

plot_periodogram(tunnel.NumVehicles);

The periodic chart is consistent with the seasonal chart above: the weekly seasonality is strong and the annual seasonality is weak. We will use indicators to model the weekly seasonality and Fourier characteristics to model the annual seasonality of each year. From right to left, the periodogram decreases between bimonthly (6) and monthly (12), so let's use 10 Fourier pairs.

We'll use DeterministicProcess to create our seasonal features, the same method we used in lesson 2 to create trend features. To use two seasonal periods (weekly and annual), we need to instantiate one of them as an "add-on":

from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess

fourier = CalendarFourier(freq="A", order=10)  # 10 sin/cos pairs for "A"nnual seasonality

dp = DeterministicProcess(
    index=tunnel.index,
    constant=True,               # dummy feature for bias (y-intercept)
    order=1,                     # trend (order 1 means linear)
    seasonal=True,               # weekly seasonality (indicators)
    additional_terms=[fourier],  # annual seasonality (fourier)
    drop=True,                   # drop terms to avoid collinearity
)

X = dp.in_sample()  # create features for dates in tunnel.index

After creating the feature set, we can fit the model and predict it. We will add a 90 day prediction to understand how our model infers beyond the training data. The code here is the same as that in the previous course.

y = tunnel["NumVehicles"]

model = LinearRegression(fit_intercept=False)
_ = model.fit(X, y)

y_pred = pd.Series(model.predict(X), index=y.index)
X_fore = dp.out_of_sample(steps=90)
y_fore = pd.Series(model.predict(X_fore), index=X_fore.index)

ax = y.plot(color='0.25', style='.', title="Tunnel Traffic - Seasonal Forecast")
ax = y_pred.plot(ax=ax, label="Seasonal")
ax = y_fore.plot(ax=ax, label="Seasonal Forecast", color='C3')
_ = ax.legend()

In time series, we can do more to improve our prediction. In the next lesson, we will learn how to use the time series itself as a feature. Using time series as the input of prediction allows us to model another situation that often occurs in the series: cycle.

It's your turn

Create seasonal features for store sales And extend these techniques to capture holiday effects.

Topics: Python Machine Learning

Programmer Think