Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/examples/autoformer-transformers-are-effective.ipynb
Views: 2500
Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)
Introduction
A few months ago, we introduced the Informer model (Zhou, Haoyi, et al., 2021), which is a Time Series Transformer that won the AAAI 2021 best paper award. We also provided an example for multivariate probabilistic forecasting with Informer. In this post, we discuss the question: Are Transformers Effective for Time Series Forecasting? (AAAI 2023). As we will see, they are.
Firstly, we will provide empirical evidence that Transformers are indeed Effective for Time Series Forecasting. Our comparison shows that the simple linear model, known as DLinear, is not better than Transformers as claimed. When compared against equivalent sized models in the same setting as the linear models, the Transformer-based models perform better on the test set metrics we consider. Afterwards, we will introduce the Autoformer model (Wu, Haixu, et al., 2021), which was published in NeurIPS 2021 after the Informer model. The Autoformer model is now available in 🤗 Transformers. Finally, we will discuss the DLinear model, which is a simple feedforward network that uses the decomposition layer from Autoformer. The DLinear model was first introduced in Are Transformers Effective for Time Series Forecasting? and claimed to outperform Transformer-based models in time-series forecasting.
Let's go!
Benchmarking - Transformers vs. DLinear
In the paper Are Transformers Effective for Time Series Forecasting?, published recently in AAAI 2023, the authors claim that Transformers are not effective for time series forecasting. They compare the Transformer-based models against a simple linear model, which they call DLinear. The DLinear model uses the decomposition layer from the Autoformer model, which we will introduce later in this post. The authors claim that the DLinear model outperforms the Transformer-based models in time-series forecasting. Is that so? Let's find out.
Dataset | Autoformer (uni.) MASE | DLinear MASE |
---|---|---|
Traffic | 0.910 | 0.965 |
Exchange-Rate | 1.087 | 1.690 |
Electricity | 0.751 | 0.831 |
The table above shows the results of the comparison between the Autoformer and DLinear models on the three datasets used in the paper. The results show that the Autoformer model outperforms the DLinear model on all three datasets.
Next, we will present the new Autoformer model along with the DLinear model. We will showcase how to compare them on the Traffic dataset from the table above, and provide explanations for the results we obtained.
TL;DR: A simple linear model, while advantageous in certain cases, has no capacity to incorporate covariates compared to more complex models like transformers in the univariate setting.
Autoformer - Under The Hood
Autoformer builds upon the traditional method of decomposing time series into seasonality and trend-cycle components. This is achieved through the incorporation of a Decomposition Layer, which enhances the model's ability to capture these components accurately. Moreover, Autoformer introduces an innovative auto-correlation mechanism that replaces the standard self-attention used in the vanilla transformer. This mechanism enables the model to utilize period-based dependencies in the attention, thus improving the overall performance.
In the upcoming sections, we will delve into the two key contributions of Autoformer: the Decomposition Layer and the Attention (Autocorrelation) Mechanism. We will also provide code examples to illustrate how these components function within the Autoformer architecture.
Decomposition Layer
Decomposition has long been a popular method in time series analysis, but it had not been extensively incorporated into deep learning models until the introduction of the Autoformer paper. Following a brief explanation of the concept, we will demonstrate how the idea is applied in Autoformer using PyTorch code.
Decomposition of Time Series
In time series analysis, decomposition is a method of breaking down a time series into three systematic components: trend-cycle, seasonal variation, and random fluctuations. The trend component represents the long-term direction of the time series, which can be increasing, decreasing, or stable over time. The seasonal component represents the recurring patterns that occur within the time series, such as yearly or quarterly cycles. Finally, the random (sometimes called "irregular") component represents the random noise in the data that cannot be explained by the trend or seasonal components.
Two main types of decomposition are additive and multiplicative decomposition, which are implemented in the great statsmodels library. By decomposing a time series into these components, we can better understand and model the underlying patterns in the data.
But how can we incorporate decomposition into the Transformer architecture? Let's see how Autoformer does it.
Decomposition in Autoformer
Autoformer architecture from the paper |
Autoformer incorporates a decomposition block as an inner operation of the model, as presented in the Autoformer's architecture above. As can be seen, the encoder and decoder use a decomposition block to aggregate the trend-cyclical part and extract the seasonal part from the series progressively. The concept of inner decomposition has demonstrated its usefulness since the publication of Autoformer. Subsequently, it has been adopted in several other time series papers, such as FEDformer (Zhou, Tian, et al., ICML 2022) and DLinear (Zeng, Ailing, et al., AAAI 2023), highlighting its significance in time series modeling.
Now, let's define the decomposition layer formally:
For an input series \(\mathcal{X} \in \mathbb{R}^{L \times d}\) with length \(L\), the decomposition layer returns \(\mathcal{X}\textrm{trend}, \mathcal{X}\textrm{seasonal}\) defined as:
And the implementation in PyTorch:
As you can see, the implementation is quite simple and can be used in other models, as we will see with DLinear. Now, let's explain the second contribution - Attention (Autocorrelation) Mechanism.
Attention (Autocorrelation) Mechanism
Vanilla self attention vs Autocorrelation mechanism, from the paper |
In addition to the decomposition layer, Autoformer employs a novel auto-correlation mechanism which replaces the self-attention seamlessly. In the vanilla Time Series Transformer, attention weights are computed in the time domain and point-wise aggregated. On the other hand, as can be seen in the figure above, Autoformer computes them in the frequency domain (using fast fourier transform) and aggregates them by time delay.
In the following sections, we will dive into these topics in detail and explain them with code examples.
Frequency Domain Attention
Attention weights computation in frequency domain using FFT, from the paper |
In theory, given a time lag \(\tau\), autocorrelation for a single discrete variable \(y\) is used to measure the "relationship" (pearson correlation) between the variable's current value at time \(t\) to its past value at time \(t-\tau\):
Using autocorrelation, Autoformer extracts frequency-based dependencies from the queries and keys, instead of the standard dot-product between them. You can think about it as a replacement for the \(QK^T\) term in the self-attention.
In practice, autocorrelation of the queries and keys for all lags is calculated at once by FFT. By doing so, the autocorrelation mechanism achieves \(O(L \log L)\) time complexity (\(L\) is the input time length), similar to Informer's ProbSparse attention. Note that the theory behind computing autocorrelation using FFT is based on the Wiener–Khinchin theorem, which is outside the scope of this blog post.
Now, we are ready to see the code in PyTorch:
Quite simple! 😎 Please be aware that this is only a partial implementation of autocorrelation(Q,K)
, and the full implementation can be found in 🤗 Transformers.
Next, we will see how to aggregate our attn_weights
with the values by time delay, process which is termed as Time Delay Aggregation.
Time Delay Aggregation
Aggregation by time delay, from the Autoformer paper |
Let's consider the autocorrelations (referred to as attn_weights
) as \(\mathcal{R_{Q,K}}\). The question arises: how do we aggregate these \(\mathcal{R_{Q,K}}(\tau_1), \mathcal{R_{Q,K}}(\tau_2), ..., \mathcal{R_{Q,K}}(\tau_k)\) with \(\mathcal{V}\)? In the standard self-attention mechanism, this aggregation is accomplished through dot-product. However, in Autoformer, we employ a different approach. Firstly, we align \(\mathcal{V}\) by calculating its value for each time delay \(\tau_1, \tau_2, ... \tau_k\), which is also known as Rolling. Subsequently, we conduct element-wise multiplication between the aligned \(\mathcal{V}\) and the autocorrelations. In the provided figure, you can observe the left side showcasing the rolling of \(\mathcal{V}\) by time delay, while the right side illustrates the element-wise multiplication with the autocorrelations.
It can be summarized with the following equations:
And that's it! Note that \(k\) is controlled by a hyperparameter called autocorrelation_factor
(similar to sampling_factor
in Informer), and softmax is applied to the autocorrelations before the multiplication.
Now, we are ready to see the final code:
We did it! The Autoformer model is now available in the 🤗 Transformers library, and simply called AutoformerModel
.
Our strategy with this model is to show the performance of the univariate Transformer models in comparison to the DLinear model which is inherently univariate as will shown next. We will also present the results from two multivariate Transformer models trained on the same data.
DLinear - Under The Hood
Actually, DLinear is conceptually simple: it's just a fully connected with the Autoformer's DecompositionLayer
. It uses the DecompositionLayer
above to decompose the input time series into the residual (the seasonality) and trend part. In the forward pass each part is passed through its own linear layer, which projects the signal to an appropriate prediction_length
-sized output. The final output is the sum of the two corresponding outputs in the point-forecasting model:
In the probabilistic setting one can project the context length arrays to prediction-length * hidden
dimensions via the linear_seasonal
and linear_trend
layers. The resulting outputs are added and reshaped to (prediction_length, hidden)
. Finally, a probabilistic head maps the latent representations of size hidden
to the parameters of some distribution.
In our benchmark, we use the implementation of DLinear from GluonTS.
Example: Traffic Dataset
We want to show empirically the performance of Transformer-based models in the library, by benchmarking on the traffic
dataset, a dataset with 862 time series. We will train a shared model on each of the individual time series (i.e. univariate setting). Each time series represents the occupancy value of a sensor and is in the range [0, 1]. We will keep the following hyperparameters fixed for all the models:
The transformers models are all relatively small with:
Instead of showing how to train a model using Autoformer
, one can just replace the model in the previous two blog posts (TimeSeriesTransformer and Informer) with the new Autoformer
model and train it on the traffic
dataset. In order to not repeat ourselves, we have already trained the models and pushed them to the HuggingFace Hub. We will use those models for evaluation.
Load Dataset
Let's first install the necessary libraries:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.2/7.2 MB 70.9 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 485.6/485.6 kB 40.6 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.4/81.4 kB 9.9 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.6/227.6 kB 25.4 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 75.3 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.9/53.9 kB 6.5 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 236.8/236.8 kB 25.7 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 95.7 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 79.3 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.5/110.5 kB 13.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 212.5/212.5 kB 22.7 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.3/134.3 kB 14.4 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 52.0 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 720.6/720.6 kB 32.1 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 19.5 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114.5/114.5 kB 9.6 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 268.8/268.8 kB 25.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 149.6/149.6 kB 9.4 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 519.2/519.2 kB 33.1 MB/s eta 0:00:00
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.12.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.
tensorflow-datasets 4.9.2 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.
tensorflow-metadata 1.13.1 requires protobuf<5,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 304.5/304.5 kB 8.9 MB/s eta 0:00:00
The traffic
dataset, used by Lai et al. (2017), contains the San Francisco Traffic. It contains 862 hourly time series showing the road occupancy rates in the range \([0, 1]\) on the San Francisco Bay Area freeways from 2015 to 2016.
Let's visualize a time series in the dataset and plot the train/test split:
Let's define the train/test splits:
Define Transformations
Next, we define the transformations for the data, in particular for the creation of the time features (based on the dataset or universal ones).
We define a Chain
of transformations from GluonTS (which is a bit comparable to torchvision.transforms.Compose
for images). It allows us to combine several transformations into a single pipeline.
The transformations below are annotated with comments to explain what they do. At a high level, we will iterate over the individual time series of our dataset and add/remove fields or features:
Define InstanceSplitter
For training/validation/testing we next create an InstanceSplitter
which is used to sample windows from the dataset (as, remember, we can't pass the entire history of values to the model due to time and memory constraints).
The instance splitter samples random context_length
sized and subsequent prediction_length
sized windows from the data, and appends a past_
or future_
key to any temporal keys in time_series_fields
for the respective windows. The instance splitter can be configured into three different modes:
mode="train"
: Here we sample the context and prediction length windows randomly from the dataset given to it (the training dataset)mode="validation"
: Here we sample the very last context length window and prediction window from the dataset given to it (for the back-testing or validation likelihood calculations)mode="test"
: Here we sample the very last context length window only (for the prediction use case)
Create PyTorch DataLoaders
Next, it's time to create PyTorch DataLoaders, which allow us to have batches of (input, output) pairs - or in other words (past_values
, future_values
).
Evaluate on Autoformer
We have already pre-trained an Autoformer model on this dataset, so we can just fetch the model and evaluate it on the test set:
At inference time, we will use the model's generate()
method for predicting prediction_length
steps into the future from the very last context window of each time series in the training set.
The model outputs a tensor of shape (batch_size
, number of samples
, prediction length
, input_size
).
In this case, we get 100
possible values for the next 24
hours for each of the time series in the test dataloader batch which if you recall from above is 64
:
We'll stack them vertically, to get forecasts for all time-series in the test dataset: We have 7
rolling windows in the test set which is why we end up with a total of 7 * 862 = 6034
predictions:
So the result for the Autoformer model is:
To plot the prediction for any time series with respect to the ground truth test data, we define the following helper:
For example, for time-series in the test set with index 4
:
Evaluate on DLinear
A probabilistic DLinear is implemented in gluonts
and thus we can train and evaluate it relatively quickly here:
Train the model:
And evaluate it on the test set:
So the result for the DLinear model is:
As before, we plot the predictions from our trained DLinear model via this helper:
The traffic
dataset has a distributional shift in the sensor patterns between weekdays and weekends. So what is going on here? Since the DLinear model has no capacity to incorporate covariates, in particular any date-time features, the context window we give it does not have enough information to figure out if the prediction is for the weekend or weekday. Thus, the model will predict the more common of the patterns, namely the weekdays leading to poorer performance on weekends. Of course, by giving it a larger context window, a linear model will figure out the weekly pattern, but perhaps there is a monthly or quarterly pattern in the data which would require bigger and bigger contexts.
Conclusion
How do Transformer-based models compare against the above linear baseline? The test set MASE metrics from the different models we have are below:
Dataset | Transformer (uni.) | Transformer (mv.) | Informer (uni.) | Informer (mv.) | Autoformer (uni.) | DLinear |
---|---|---|---|---|---|---|
Traffic | 0.876 | 1.046 | 0.924 | 1.131 | 0.910 | 0.965 |
As one can observe, the vanilla Transformer which we introduced last year gets the best results here. Secondly, multivariate models are typically worse than the univariate ones, the reason being the difficulty in estimating the cross-series correlations/relationships. The additional variance added by the estimates often harms the resulting forecasts or the model learns spurious correlations. Recent papers like CrossFormer (ICLR 23) and CARD try to address this problem in Transformer models. Multivariate models usually perform well when trained on large amounts of data. However, when compared to univariate models, especially on smaller open datasets, the univariate models tend to provide better metrics. By comparing the linear model with equivalent-sized univariate transformers or in fact any other neural univariate model, one will typically get better performance.
To summarize, Transformers are definitely far from being outdated when it comes to time-series forcasting! Yet the availability of large-scale datasets is crucial for maximizing their potential. Unlike in CV and NLP, the field of time series lacks publicly accessible large-scale datasets. Most existing pre-trained models for time series are trained on small sample sizes from archives like UCR and UEA, which contain only a few thousands or even hundreds of samples. Although these benchmark datasets have been instrumental in the progress of the time series community, their limited sample sizes and lack of generality pose challenges for pre-training deep learning models.
Therefore, the development of large-scale, generic time series datasets (like ImageNet in CV) is of the utmost importance. Creating such datasets will greatly facilitate further research on pre-trained models specifically designed for time series analysis, and it will improve the applicability of pre-trained models in time series forecasting.
Acknowledgements
We express our appreciation to Lysandre Debut and Pedro Cuenca their insightful comments and help during this project ❤️.