Understanding Online Training for ML Engineers

A new approach to training your AI Models

Devansh
13 min readMar 19, 2024

Regular readers will recognize the name Logan Thorneloe (Twitter/Linkedin), as a resource we have referred to many times during our articles. He is a machine learning engineer at Google currently focused on building efficient machine learning systems at scale. He also writes the excellent newsletter Society’s Backend: which helps others learn about machine learning engineering and understand AI. For those interested, he has also created a road map to help anyone navigate the free, high-quality machine-learning educational resources available online.

Join 150K+ tech leaders and get insights on the most important ideas in AI straight to your inbox through my free newsletter- AI Made Simple

Online machine learning is a method for keeping a machine learning model continually updated in production. Instead of batch training, where a model is given data, trained, validated, and sent to serving, online training allows all steps of that process to happen in real-time (or near real-time). This means as data comes in, a model is trained on it and updates in production so users have access to the updated model immediately.

There are two primary reasons to use an online training platform:

  • A model needs to be frequently updated in production.
  • A model is trained on data that is non-stationary.

If a model fits either or both of these, it can potentially benefit from online training by improving model accuracy and decreasing resource costs. I’ll explain this further below.

It’s interesting to read about online training because it’s much less frequently utilized than batch training and requires more infrastructure and planning to get right, causing online learning strategies to differ widely. This has made online learning often misunderstood, particularly relating to how online learning works and why a team might use it.

I’m going to walk through the basics of how online training works, how it differs from batch training, and why a team might use it to train their model. I’m not going to get into the specifics for how to implement an online learning platform because i) it differs widely between use cases and ii) it requires expertise to implement properly.

Throughout this article, I’m going to refer to two different papers regarding online training platforms that have produced models you likely use each day. The first paper is ByteDance’s Monolith which is (very likely, but not explicitly stated) used by TikTok to recommend videos and Google’s click-through-rate models which is part of a system used to serve Google ads.

How Batch Training Works

Let’s start with a general overview of batch training so we can more easily compare it to online training. Batch training begins with a model and a dataset to train on. The model parameters (or weights) are usually randomly initialized to start and are updated as the model sees more data.

To feed data into the model for training, we first split our training data into two sets: training set and test set (generally an 80/20 split). We’re going to use the training set now and we’ll come back to the test set during validation. It’s important for validation that our model be tested on data it has never seen. Thus, we exclude the test set from model training.

Next, we start feeding our data into the model. To do this, we split our training set into multiple batches. Batches allow our model to train over less data at a time. Since training requires so much data, a model needs to train over less data at a time to fit the entirety of the data it needs to use in memory. To split these batches, we randomly sample a number of training samples (our batch size) from our training set and feed that into the model. We do this over the entirety of the training set to ensure our model sees all training samples. One pass over all the data is considered one epoch.

While feeding batches through the model, we do what are called forward and backward passes of the model. On the forward pass, we use our input and parameters to make a prediction (create an output) with the model. That prediction is then used to calculate loss, which is a comparison between our prediction and what we expected. The gradient (or derivative) of this loss is used on the backward pass to change our model parameters so our prediction can be better next time. This forward and backward pass process is done multiple times during an epoch.

At the end of the epoch, we perform validation. This uses the test set to evaluate model performance. It’s important to use the test set here to ensure a model can generalize predictions outside of the data it’s trained on. Evaluation is used to test model performance and ensure training quality throughout the training process.

The loop of forward pass, backward pass, and validation is done repeatedly so a model runs through the training data many times. The goal is for validation accuracy to improve and for loss to decrease as training continues. As loss gets smaller a model gets more stable and approaches the point where it can no longer learn more from the training set. This is the point where training has finished. As a model gets closer to finishing training, its learning rate will decrease. Decreasing the learning rate means the model makes much smaller changes to its weights during backward passes to ensure the model doesn’t overshoot the optimal parameters when updating them.

Once a batch training model is finished training (and is validated appropriately), it can be put into production. This means users can query an endpoint to perform inference (make a prediction) on the model. Models in production are replaced when a new model is trained that achieves better performance.

How Online Training Works

Online training works a bit differently. Instead of starting with a full training dataset, online training receives data samples on the fly. This means there’s a steady stream of data samples to feed into the model. The first difference between batch and online training is online training requires significantly more data and that data needs a low-latency input to the model.

Similar to batch training, the model does a forward and backward pass on samples it receives to adjust weights. The second difference occurs here: Once a model trains on a sample, it generally doesn’t see that sample again. This is called sequential training. Online models will train on a data sample once and then move on to the other data samples coming into the system. Since there is more than enough data to train, there’s no need to batch it and revisit to improve model performance.

So if we never revisit data, what’s considered an epoch in online learning? It can be whatever you define it as. Similar to batch training, it’s a group of training samples, but since there isn’t a definitive training set, it isn’t defined by the total number of training samples. Instead, it’s defined by either i) a number of training samples or ii) a period of time spent training on data. Just as in batch training, epochs are also completed in online training.

Unlike batch training, we don’t validate our model at the end of an epoch in online learning. How do we validate an online model? This is our fourth difference and it’s something I find really fascinating. Since our model is constantly training and only ever sees a data sample once, we can validate on a data sample before training on it. Just like batch training, we have to use data we haven’t seen to validate our model. Unlike batch training, we don’t visit data samples multiple times, so there’s no need to separate out data for validation. This process is called progressive validation.

Like batch training, validation is used to assess the performance of a model during the training process. This training loop continues as long as data is coming in and we need this model serving user requests. Data will continue to stream into the system and the model will use that data to train and improve model weights.

Our fifth difference between batch and online training is that we don’t stop training. Not until another model is ready to replace the one we’re currently training. This is done when a different model (think architecture change, not weight updates) has shown better performance in the task at hand. In production systems, this usually means better accuracy and/or better resource usage. A better model is determined via experimentation, or training another model(s) “offline”, comparing it to our online model, and putting it online (in serving) when it performs better.

Notice how the replacement model is trained “offline”. Models are trained offline to catch up to the leading edge of incoming data. We could just start training a model online, but that would take too long for a model to see the performance needed to replace current models. This isn’t justifiable from a time or resource standpoint. With many training samples already having gone through the system, we can use those to train the model offline. This means they aren’t getting the cutting-edge data, but instead training on data online models have previously seen until they get up to the point where they can train online.

The training architecture from the ByteDance paper showing the distinction between the offline batch training and the online training stages.

Offline training in online learning systems can be done in multiple ways, but there are two I’m most familiar with. The first is using the data training online models to batch train offline models. This offers the flexibility and resource efficiency of batch training until the model starts training online. The other way is to mimic online training offline (so doing the exact training process as listed in this section above) but using more resources to go through those training steps faster. This allows a model to “catch up” to the cutting edge of data while simulating the training process online models are going through. It’s important to note that this approach also requires some sort of batching to parallelize the data and increase training data throughput, but it also preserves many of the benefits of sequential training (i.e. better performance on non-stationary tasks).

One last difference I’d like to highlight between batch and online training is how we apply the learning rate in online training. In batch training, the learning rate decreases toward the end of training to ensure the optimal parameters are reached. In online training, there isn’t an end to training, and therefore no decrease in the learning rate. Instead, the learning rate is data dependent and fluctuates. You can read more about in section 4.2 of the Google click-through rate paper.

Benefits of Online Training

  • Real-time updates: Online training makes it very simple and cost-effective to update a model’s weights with very low latency. This is especially beneficial for applications that are non-stationary, where frequent updates are necessary to optimize model performance.
  • Sequential information: Online trained machine learning models receive information sequentially, making them particularly useful for non-stationary applications. Sequential training causes the model to be biased toward more recent training samples when making predictions. This is ideal for tasks that are frequently changing, such as recommendation systems and ads.
  • Resource efficiency: One of the misconceptions I see about online training is that it’s data-efficient. In reality, it requires more data samples than batch training does (as mentioned above) for similar performance, thus making it data-inefficient. It is resource-efficient however, because data samples only need to be visited once to learn from them instead of repeatedly visiting a sample over multiple epochs. This resource efficiency is especially cost-effective at scale.
  • No overfitting: Batch training models run the risk of overfitting if model training parameters aren’t tuned properly or the model isn’t trained on a diverse enough dataset. Overfitting means the model will form its predictions around the specific data it’s trained on and won’t be able to generalize to other samples. Having a separate test set helps mitigate this. In online training, data samples are only ever visited once, making it so this isn’t an issue.
  • Reproducibility: Randomly sampling during batch training makes it particularly difficult to reproduce trained models. While a model may be retrained to achieve similar performance on a task, the stochastic nature of the training process makes it nearly impossible to achieve the same model on different training runs. Sequential training makes this more feasible by enabling multiple models to train on the same data in the same order with the same parameters. This is especially useful for experimentation and other applications where model comparison is critical.
An interesting takeaway found in the Google paper: While many methods can be used to decrease training costs, working smarter with data was by far the best. Machine learning efficiency is really just understanding your data.

Complications of Online Training

  • Added infrastructure: Online training requires significant changes to learning infrastructure. It requires a method of streaming live data to the training service. This means data must be collected, preprocessed, and stored with low latency. Data also needs to be stored in a way that preserves the collection timestamp and makes it easy to load training data for both online and offline training. Other infrastructure additions include online-specific learning metrics to compare model performance with other models and infrastructure to continually update models in training and serving.
  • Bad data: Online, sequentially-trained models are particularly susceptible to bad inputs. Since sequentially trained models only visit data once and the most recent data tends to carry more influence, bad incoming data can impact model performance. This means two things: online training needs quick rollbacks built into training infrastructure (i.e. if a model is trained on poor data it can be reverted quickly) and validation and testing infrastructures need to catch bad inputs whenever possible.
  • Late-arriving information: Live-streaming training data also adds further difficulty if some of the information you need for training isn’t available immediately. For example, if the training data coming through shows a user clicking a link, the late-arriving data might tell the model what the user does after clicking the link, which may contain important information for model performance. Online training platforms need a way of dealing with this issue.
  • Redundancy in training: Online training requires a certain level of redundancy in the training process. This is obvious for serving; if a compute cluster goes down, there need to be back ups that can process incoming requests. For online training, if a training cluster goes down, others need to pick up the training job quickly to ensure the model stays on the cutting edge of incoming data.
  • More data and more compute: Online training requires a stream of real-time data to continuously train and requires enough compute to continually train on that data. This means online training generally has a higher data and compute requirement than batch training, but is able to use the compute and data resources more efficiently.
Remember the reality of production ML systems isn’t just about maximizing model performance, but instead about maximizing efficiency while improving model quality. Resource costs are a top factor in machine learning at scale.

Should Your Team Use Online Training?

Now that you understand how online machine learning works and its benefits, you may be asking yourself: Should I use online machine learning? The important factor driving the decision toward online learning should be your application of machine learning. Simply put, this decision can be made by determining if your application benefits from online machine learning. If it does, use it. This is the reason online training is used for click prediction as described in the Google paper:

“Click-through data is highly non-stationary: click prediction is fundamentally an online recommendation problem.”

Think about the following considerations:

  1. Does your system benefit from real-time model updates? This is the primary motivator for online machine learning. Some use cases require this and some benefit greatly from it. For example: recommendation systems, stock predictions, weather predictions, and ads all benefit from this.
  2. Do you have the data and compute requirements necessary for an online training platform? Online training requires a steady flow of real-time data to train models. It also requires the compute to constantly use this data. Without either, benefiting from online learning becomes very difficult.
  3. Does your platform benefit from the built-in advantages of online machine learning? Is your training data non-stationary? Is reproducibility important? Both of these are key advantages of online, sequential machine learning and may justify the costs of implementing online learning infrastructure.

The bottom line is understanding if your machine learning use-case benefits greatly from online learning and if you’re in a position to implement it. If both of these are true, it is likely worth the cost to implement and train on an online system.

If you liked this article and wish to share it, please refer to the following guidelines.

That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow.

If you find AI Made Simple useful and would like to support my writing- please consider becoming a premium member of my cult by subscribing below. Subscribing gives you access to a lot more content and enables me to continue writing. This will cost you 400 INR (5 USD) monthly or 4000 INR (50 USD) per year and comes with a 60-day, complete refund policy. Understand the newest developments and develop your understanding of the most important ideas, all for the price of a cup of coffee.

Become a premium member

I regularly share mini-updates on what I read on the Microblogging sites X(https://twitter.com/Machine01776819), Threads(https://www.threads.net/@iseethings404), and TikTok(https://www.tiktok.com/@devansh_ai_made_simple)- so follow me there if you’re interested in keeping up with my learnings.

Reach out to me

Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.

Small Snippets about Tech, AI and Machine Learning over here

AI Newsletter- https://artificialintelligencemadesimple.substack.com/

My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

--

--

Devansh

Writing about AI, Math, the Tech Industry and whatever else interests me. Join my cult to gain inner peace and to support my crippling chocolate milk addiction