2023–7–16 AI roundup: Weird step sizes help gradient descent, Better CPU matmuls

Some Machine Learning papers that you should be aware of

Devansh
12 min readJul 24, 2023

This article originally appeared over here on Substack. I am reproducing it and sharing it here with the permission of the author. I would highly recommend subscribing to the original newsletter- Davis Summarizes Papers- as it is an exceptional source of insight into AI Research.

This newsletter made possible by MosaicML.

Also a reminder that I’m now retweeting high-quality threads about papers to try to improve the signal-to-noise ratio on ML twitter. If you write a thread explaining a paper, just tag me in it and I’ll share it with ~11k followers.

Provably Faster Gradient Descent via Long Steps

Well this explains a lot. Or at least…it might.

A classic optimization result is the convergence rate of gradient descent with constant step sizes over iterations T on an L-smooth convex function f:

They generalize this result to prove convergence for repeating sequences of step sizes:

What’s surprising is that:

  1. This works even if individual steps within the sequence are too large to guarantee convergence.
  2. This gets you a better convergence guarantee than a fixed step size.

In fact, using solvers to find admissible sequences of step sizes gets them faster convergence as the sequence length increases.

A common pattern in their sequences is alternation between two small-ish step sizes, with an occasional huge step that’s roughly proportional to the sequence length. I don’t know if this pattern is fundamental or an artifact of the proof technique though.

This discovery is consistent with empirical observations in deep learning wherein having some degree of stochasticity is important. E.g., I once tried stratifying batches in image classification to reduce variance, but consistently got worse results. I’d always attributed this to variability helping you escape local minima, but:

  1. It’s not clear how much escaping local minima is a thing, and
  2. This paper instead suggests that variability helps you descend even within a given basin of attraction.

Of course, the results here differ from typical deep learning practice in that it’s a) full batch gradient descent with b) no momentum. But this is still a weird and cool enough result that it’ll keep it in mind whenever looking at deep learning optimization from now on.

QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

They take advantage of the weight matrices being fixed to perform various optimizations for CPU inference. In particular, they store weight tiles in Z-curve order and autotune some block/tile sizes.

They also support quantizing with different group sizes, storing the scales and (quantized) minimum values as separate arrays.

This approach lets them run inference on LLaMA models up to 2.6x faster than llama.cpp.

Using smaller quantization groups improves perplexity but worsens throughput, as you’d expect. Eyeballing these plots, they also seem to have reproduced the result that 4b offers a better space vs perplexity tradeoff than 3b (at least with simple post-training quantization).

I haven’t dug into the price-performance of CPU vs GPU inference recently[1], but this certainly seems useful for the CPU case. I’m still pretty certain that we’ll eventually be using vector quantization rather than scalar quantization since it compresses so much harder, but ideas like these probably work in either case.

Rosko: Row Skipping Outer Products for Sparse Matrix Multiplication Kernels

They designed faster sparse times dense matmul kernels for CPUs.

You might recall that a matrix product AB can be computed via inner products of A’s rows and B’s columns, or outer products of A’s columns and B’s rows.

The way you actually do CPU matmuls using the latter strategy is by loading scalars (not vectors) from one matrix and just broadcasting them to match vectors from the second matrix. So they just traverse the columns of the sparse matrix and only do these scalar broadcasts for the nonzeros.

This is way harder than it sounds because you want to use your cache space, cache bandwidth, DRAM bandwidth, and vector registers well. It’s pretty well-understood how to do this when your tiles are dense, but having unstructured sparsity makes the problem tricky.

Their method seems to work better than MKL, which is really saying something.

I’m not big on unstructured sparsity in general, but I could definitely see kernels like this being useful for handling outliers in LLMs (c.f., SpQR, SqueezeLLM).

Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

Why only use LoRA for finetuning? Why not use it during pretraining too?

Well…one reason is that you’re throwing away of lot of model capacity if you do it naively, since you’d guarantee that your weights move only within a tiny subspace.

But you can avoid this by applying LoRA iteratively; in each iteration you a) fuse the low-rank matrices into the frozen weights and b) create new low-rank matrices. With rank R subspaces and N LoRA restarts, you could in principle move in a subspace of rank R*N.

To get this to work, you have to re-warmup the learning rate and reset the optimizer state at the start of each iteration. Otherwise you diverge.

In practice, this method moves through a higher-volume subspace than just doing LoRA once, but not as high-volume as full finetuning.

In terms of perplexity, repeated LoRA is better than just LoRA-ing once, but not quite as good as full finetuning.

Test-Time Training on Video Streams

You might not have labels at test time, but you can still do unsupervised training to adapt to distribution shifts in your inputs.

It’s not that surprising that doing test-time learning can do better than a frozen baseline model if you tune it right. But what is surprising is that you can doing online SSL beats doing offline SSL on the whole test video. Like, usually you would expect cheating to be better than a realizable algorithm.

Here’s what they did in more detail. Consistent with previous work, they have a Y-shaped model with a shared encoder that feeds into a task head and a reconstruction head. For each test-time frame, they use a masked autoencoding objective to run one update step on the encoder and reconstruction head to reduce the loss on a sliding window of recent frames. After the update, they make a prediction using the encoder + task head. At the end of each video, they revert all the test-time weight changes.

Their approach seems to work really well, even compared to other test-time training methods.

They also introduce the COCO Videos dataset, which consists of unusually long videos and unusually many classes compared to most alternatives.

Interestingly, using a moderate sliding window size of 16 to 64 frames in their optimization works better than using the whole video-so-far. This is evidence for their claim that exploiting temporal locality/smoothness in the videos is beneficial, and kind of makes sense from a bias vs variance perspective.

This result also makes me wonder if we could do even better by intelligently selecting the sliding window boundaries. E.g., maybe you reset the window if and only if there’s a cut or scene change.

This is pretty interesting, and makes me more bullish about continual learning being broadly useful. But boy is it gonna mess with everyone’s inference infrastructure. Maybe we should have all started listening to Chip Huyen sooner.

Instruction Mining: High-Quality Instruction Data Selection for Large Language Models

They try to identify high-quality subsets of instruction tuning datasets using linear regression. Basically, they:

  1. Define a bunch of features they can extract from each sample,
  2. Learn weights for these features to predict the eval loss (observed by finetuning using random subsets of samples), and then
  3. Take samples with features that correlate with lower loss.

Here are the features they extracted,

the datasets they pulled samples from,

and the learned coefficients for the features. Text length, reward model predictions, and SentenceBERT KNN distances are pretty predictive.

Finetuning LLaMA-7B using samples that their linear model suggests are good seems to do better than using random samples, as evaluated by GPT-3.5 and GPT-4.

Another example of machine learning improving machine learning. I kind of suspect that all the random design choices and heuristics we make up to lift accuracy will become data-driven at some point — at least given numerous enough observations and good enough tooling.

Substance or Style: What Does Your Image Embedding Know?

Different methods of constructing image embeddings yield different information content and invariances.

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models

They failed to replicate a bunch of claimed pretraining speedups for BERT and T5. Sometimes methods that make each training step slower look better when you only report step count, but not when you look at wall time. You also have to make sure you compare results at the end of training when the learning rate is fully decayed, as opposed to comparing partially-trained models to fully-trained ones.

I just wanna 💯 this. It’s vital to hold time + hardware constant and get through your entire learning rate schedule.

Plus, the difficulty of replicating claimed wins is something we’ve learned the hard way at MosaicML. For every speedup that makes it into a winning MLPerf submission, there’s at least one more that we couldn’t replicate — and that’s after careful triage and examination of the literature to find what’s exceptionally promising.

2

You also have the additional complication that certain methods work only in certain circumstances; e.g., regularization methods often suck until you start training for a long time. And figuring when different methods compose is yet another problem…

^ Figure from our training efficiency survey.

In-context Autoencoder for Context Compression in a Large Language Model

They propose to train an auxiliary LLM to compress your main LLM’s context into a shorter sequence of embeddings.

The auxiliary LLM is just a frozen model finetuned with LoRA. It’s trained to output embeddings that the target LLM can use to reconstruct the original context.

The target LLM is also pretrained with a regular old text completion objective. You feed it a special “[AE]” token to tell it whether to autoencode or do next-token prediction.

They find that you can compress your context by roughly a factor of 2 with little or no loss of quality, but it degrades hard past a certain point.

Not sure this is a speed vs accuracy win (except maybe with quadratic attention at long sequence lengths?), but it’s definitely interesting that 2x compression is possible. I wouldn’t be surprised if some sort of trained compression ends up replacing tokenizers.

Self-consistency for open-ended generations

One way of getting better LLM outputs is to sample a number of generations and take whichever output occurs the most often. This works great when there are only a few possible outputs, like “true” or “false.” But what about when the output space can be anything?

They propose to score generations based on how similar to they are to other generations, which is a relaxation of counting exact matches.

They propose a few different similarity measures based on n-gram statistics, token probabilities, and length.

A length-normalized unigram scoring function that takes into account probabilities (“Consensus-WUCS”) often works the best for both code and text generation. The non-length-normalized version (“WUCS”) is a close second, at least for code.

Seems like it could be a practical + general win for improving output quality while holding the model constant — though you do have to pay the extra inference cost of performing multiple generations.

Work on ML? Like paper summaries? Want to bolster my self-esteem? Join 11,000+ ML researchers and practitioners by adding your email address. 👇

https://dblalock.substack.com/

1 Mostly since AMX instruction reciprocal throughputs aren’t in the Intel intrinsics guide or Agner Fog’s instruction tables, and I can’t be bothered to rent a Sapphire Rapids chip.

2 Which, incidentally, is the origin story of this newsletter. I was just going through the literature for our internal purposes and everyone was like “you should totally make this into a newsletter!”

Join 90K+ tech leaders and get insights on the most important ideas in AI straight to your inbox through my free newsletter- AI Made Simple

I’ll catch y’all with more of these next week. In the meanwhile if you’d like to find me, here are my social links-

Reach out to me

Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.

Small Snippets about Tech, AI and Machine Learning over here

AI Newsletter- https://artificialintelligencemadesimple.substack.com/

My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

--

--

Devansh

Writing about AI, Math, the Tech Industry and whatever else interests me. Join my cult to gain inner peace and to support my crippling chocolate milk addiction