Are Transformers effective for Time Series Forecasting

How well does Machine Learning’s Favorite Architecture stack up against simple autoregressive models

Published in

Geek Culture

7 min readJan 14, 2023

Join 31K+ AI People keeping in touch with the most important ideas in Machine Learning through my free newsletter over here

2022 was a breakout year for Transformers in AI Research. ChatGPT has been all the rage recently. People have been using it for all kinds of tasks- from writing sales emails and finishing college assignments, to even as a possible alternative to Google Search. Combine this with other Large Language Models like BERT, AI Art generators like Stable Diffusion and DALLE, and Google’s hits like GATO and Minerva for robotics and Math, and it seems like huge Transformers trained on a lot of data and compute seem like God’s Gift to Machine Learning.

Transformers were created as upgrades to Recurrent Neural Networks, to handle sequential data. While they have undeniably been much better at processing text information, RNNs have had a strong presence in tasks involving Time Series Forecasting. This leads to the question, “Are Transformers Effective for Time Series Forecasting?” The authors of a paper with the same name set out to this? To do so, they compared Transformers with a simple model, that they refer to as DLinear.The architecture of DLinear can be seen below.

This model is stupidly simple. This simplicity comes with a lot of benefits that we will cover

By comparing the performance of DLinear against Transformer based solutions, the authors provided their answer to the utility of Transformers in Time Series Forecasting (TSF). In this article, I will be breaking down their findings and going to their results to help you better understand the findings of these researchers.

Spoiler Alert: The results were not pretty for Transformers.

This might come as a shock to many of you, but models like ARIMA have a lot of benefits that beat Deep Learning in many cases. Don't just blindly jump to LSTMs/Transformers b/c that’s what everyone claims.

Benefits of Using Simple Linear Systems

In a world of exceedingly complex and non-linear architectures, DLinear might look extremely out of place. However, there are several benefits to using a simpler model like DLinear. To quote the authors- “Although DLinear is simple, it has some compelling characteristics:

An O(1) maximum signal traversing path length: The shorter the path, the better the dependencies are captured [18], making DLinear capable of capturing both short-range and long-range temporal relations.
High-efficiency: As each branch has only one linear layer, it costs much lower memory and fewer parameters and has a faster inference speed than existing Transformers (see Table 8).
Interpretability: After training, we can visualize weights from the seasonality and trend branches to have some insights on the predicted values [8].
Easy-to-use: DLinear can be obtained easily without tuning model hyper-parameters.”

The success of using linear models also hints at the possibility that many TSF tasks have a strong linearity in them. The key differentiator of Deep Learning has been the introduction of non-linearity which makes it possible to model very complex relationships. The performance of DLinear and other simpler architectures might imply that TSF is mostly a Linear Problem. This is just conjecture on my part, based on my reading and experiences. If you have different experiences, let me know.

These results are insane. The importance of the lower asymptotic complexity can’t be overstated.

Now for the part that you have all been waiting for? How well do Transformers perform in Time Series Forecasting? Does a simple model like DLinear hold a candle to Transformers? Let’s look into a few experiments and their outcomes.

Remember, context is king. Never take a paper’s results without looking into the datasets, processing, and evaluation metrics used. They contain a lot of important nuances.

Transformers vs DLinear

The experiment setup was pretty straightforward. They compared the DLinear model against various Transformers across all the datasets mentioned. They also used a variety of time steps to add rigor to their findings. The results can be found in the table below-

On most tasks, DLinear does better. The results are mostly close, but this is a further reason to look into DLinear. The lower costs make it standout.

Obviously, these numbers can be hard to understand without any other context. Fortunately, the authors graphed the results of their experiments to help us visualize some of the differences.

I’m going to zoom in on the Exchange Rate Dataset just because the graph is not as messy as the other two. Thus the differences in performance are easier to see-

It seems like DLinear just doesn’t break down the way the transformer models do. The authors made the following observation about their experiments-

When the input length is 96 steps and the output horizon is 336 steps, Transformers [28, 27, 29] fail to capture the scale and bias of the future data on Electricity and ETTh2. Moreover, they can hardly predict a proper trend on aperiodic data such as Exchange-Rate. These phenomena further indicate the inadequacy of existing Transformers for the TSF task.

The note about aperiodic data being a challenge for transformers is particularly interesting to me. It’s something that I will be looking into more in the future. If you have any experiences/thoughts you’d like to share, please do leave them in the comments. If you’d like to have a more extended conversation, I always leave my social media links at the end of every article. I love it when my readers/viewers use that to reach out to talk about their inputs, feedback, or even just projects they’re working on. Makes for very interesting discussions and helps me learn a ton.

To finish this article, let’s talk about why Transformers are not effective for Time Series Forecasting.

We introduce an embarrassingly simple linear model DLinear as a DMS forecasting baseline for comparison. Our results show that DLinear outperforms existing Transformer-based solutions on nine widely-used benchmarks in most cases, often by a large margin
-The Conclusion by the authors

Join Noom, the psychology-based program for lasting healthy outcomes

Getting healthier means changing your lifestyle. Noom Weight helps users lose weight in a sustainable way through behavioral change psychology. It’s a no-brainer. You can learn more about Noom here.

Why Transformers fail at Time Series Forecasting

The authors had some very salient observations about Transformers and why they might be ineffective for TSF-based tasks. Their analysis points to the attention mechanism as a possible weakness for time series forecasting tasks-

More importantly, the main working power of the Transformer architecture is from its multi-head self-attention mechanism, which has a remarkable capability of extracting semantic correlations between paired elements in a long sequence (e.g., words in texts or 2D patches in images), and this procedure is permutation-invariant, i.e., regardless of the order. However, for time series analysis, we are mainly interested in modeling the temporal dynamics among a continuous set of points, wherein the order itself often plays the most crucial role.

Their observations square very well with my observations. When I was working on Supply Chain Forecasting, I noticed that ARIMAs and other simpler models were much better at adapting to supply chain shocks, where input values suddenly increase or decrease. Keep in mind that this is just my anecdotal experience working on that one project, but it does align with their prognosis of the situation.

For Machine Learning a base in Software Engineering, Math, and Computer Science is crucial. It will help you conceptualize, build, and optimize your ML. My daily newsletter, Technology Made Simple covers topics in Algorithm Design, Math, Recent Events in Tech, Software Engineering, and much more to make you a better Machine Learning Engineer. I have a special discount for my readers.

Save the time, energy, and money you would burn by going through all those videos, courses, products, and ‘coaches’ and easily find all your needs met in one place.

I am currently running a 20% discount for a WHOLE YEAR, so make sure to check it out. Using this discount will drop the prices-

800 INR (10 USD) → 533 INR (8 USD) per Month

8000 INR (100 USD) → 6400INR (80 USD) per year

You can learn more about the newsletter here. If you’d like to talk to me about your project/company/organization, scroll below and use my contact links to reach out to me.