Stanford University Explains Why CoT Prompts Work in Prompt Engineering

The structure of variables in your dataset could be maximized for better Language Model Performances

10 min readMay 31, 2023

Join 32K+ people and get insights on the most important ideas in AI straight to your inbox through my free newsletter- AI Made Simple

As research into Large Language Models become more and more mainstream, we have seen a lot of research into how they can be used more effectively. One of the techniques that have unlocked the performance of LLMs at a higher level is chain-of-thought prompting. Instead of asking an LLM for an answer directly, we instead prompt the models to generate a series of intermediate steps. This leads to better performance in certain kinds of tasks. According to the paper, “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” -

Experiments on three large language models show that chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning.

This leads to a few interesting questions about why this technique works so well, and how we can leverage it effectively. The paper- Why think step by step? Reasoning emerges from the locality of experience- brings some very interesting insights about Chain of Thought prompting and Language Models. The authors explore Why Chain of Thought Prompting works (and when it will help). Their insights have some interesting implications in designing better datasets for language models. In this article we will be breaking down the findings from the Why Think Step by Step paper in more detail in order to explore these implications. If you’re interested in working with Autoregressive Large Language Models, this is not an article you want to miss.

I hope y’all appreciate the amount of work I put into finding the right meme templates.

The basic hypothesis that the authors put forth is relatively straightforward-

We posit that chain-of-thought reasoning becomes useful exactly when the training data is structured locally, in the sense that observations tend to occur in overlapping neighborhoods of concepts

The Setup

To test their hypothesis, the authors use Bayesian networks (super underutilised tool imo). The goal for an AI agent is to estimate learner conditional probabilities from the Bayes net andneeds to accurately estimate . The twist is that the learner may not see all variables together only the locally structured observations. By giving them access to locally structured observations, we can see if the agent can hippity-hop through the connected pieces to get to the final result.

*Figure 1: Overview of our training and estimation setup. A: visualization of a Bayes net.* The pink variable is an example observed variable and the yellow variable is an example target variable. Grey variables are examples of useful intermediate variables for reasoning. Lines show examples of local neighborhoods from which training samples are drawn. B: format of the training samples. C: illustration of direct prediction and free generation estimators as prompts. We prompt the model to either immediately predict the target variable (direct prediction), or do so after generating intermediate variables and their values (free generation). We then compute mean squared errors between estimated and true conditional probabilities. D: mean squared error by number of training tokens for each train condition and estimator

To those of you that have difficulty sleeping until you see the formal definitions, you can catch it below-

There are 3 kinds of predictors used by the authors-

Direct prediction: Simply use the model to directly estimate the probability of the target variable given the value of the observed variable. This serves as a baseline.
Scaffolded generation: The scaffolded generation estimator represents ideal reasoning if we knew the best set of steps to work through. A scaffold is an ordered set S consisting of variables that were each observed with another scaffold variable and collectively d-separate the observed variable from the target variable. In the case of a chain, the scaffold consists of all variables between Yi and Yj . Variables are ordered by their distance from the observed variable in the Bayes net. We estimate each variable given the observed variable and previously-generated scaffold variables using q before estimating the target probability.

We approximately marginalize over the scaffold variables’ values using M Monte Carlo samples from the conditional distributions.

Free generation: This is like scaffolded generation but free generation uses the model to also choose which variables to instantiate rather than just to estimate their values. The authors sample variable indices and values from q until it generates the index of the target variable. Now, the probability of the target variable is computed averaged over M such samples. This estimator tests whether trained models spontaneously generate useful intermediate variables.

The training data is generated by using the following pseudocode-

Once BayessNet’s have been generated and selected according to certain criteria, we can select the sampled variables. The variables are selected according to three important criteria-

Locality- Observed samples contain only variables from a local neighborhood, consisting of a central variable along with all variables of distance at most k from it. To sample from the observation distribution, we sample the central variable uniformly randomly and then sample k from some distribution that controls the local neighborhood size.
Variable dropout- Even within a local subset of the world, we may not see everything at once. Certain variables may be missing or go unnoticed. We formalize this intuition with variable dropout. With some probability (0.2 in our experiments), variables are dropped from a local neighborhood and the learner does not see them. I really like the use of Variable dropout because it may also help a model generalize with more unseen pairs. Multiple research papers, including this one that we broke down, have shown, that the integration of dropout in models can be a game-changer for performance.
Held-out pairs Finally, target pairs of variables are held out across all training data. Performance at matching conditional probabilities for these pairs is our main performance metric. If a local neighborhood, after variable dropout, would include a pair of variables we decided to hold out, we randomly remove one of the two variables in the pair from the sample.

This is a fairly comprehensive way to account for the limitations of perception in learning. The authors combine this with control conditions-

We also create two control conditions to demonstrate the importance of a local observation distribution. As one control, we consider training data made up of local neighborhoods from the wrong Bayes net. This maintains the co-occurrence distribution structure, but the co-occurrences do not reflect the structure of which variables influence each other. As another control, we use a fully-observed condition where each sample contains almost all of the variables in the Bayes net. One of the two variables in each held-out pair is randomly dropped, but all other variables are included. These controls enable us to test whether local structure in the training data drives the value of reasoning.
-The researchers were very thorough with this one. One of benefits of reading high-level research is the exposure to well designed experiments.

and even a test to see how irrelevant variables influence the results-

We also introduce negative scaffolded generation as a control estimator that generates irrelevant intermediate variables. For each pair of variables, we select a random set of variables equal in size to the scaffold, but which does not include any of the scaffold variables. We prompt the language model to generate values for the negative scaffolds in the same way as in scaffolded generation.

These were the major components that stood out to me. Let’s move on to evaluating some of the results of their experiment-

The Results

The researchers had some interesting results that are worth paying attention to-

Firstly, we see that step-by-step prompting works when the observation distribution has the correct locality structure. When the training data is structured locally with respect to strong dependencies, both scaffolded and free generation perform significantly better than direct prediction — the reasoning gap. Scaffolded and free generation also perform significantly better than negative scaffolded generation, indicating that relevant intermediate variables help in predicting the target variable, but irrelevant intermediate variables do not.

Take a look at the image below.

It has some interesting implications-

More intermediate variables don’t seem to correlate with accuracy. This is somewhat counter-intuitive because I would assume that longer traces would lead to worse results.
The most wrong paths are the ones with the wrong local structure. This implies that training on local clusters of variables is valuable because it helps free generation produce intermediate variables that are relevant to the relationship between the observed and target variables.
Local training produced fewer intermediate variables than fully observed training (another surprise to me). This combined with the performance, implies that training on local training data might just a more efficient training approach than fully observed training.

We will now be exploring the last part. Take a look at the following analysis by the researchers

This has great potential for training LLMs in an efficient better. When I worked on changing English statements (written by business users) to SQL queries that had to be executed (potentially joining multiple tables), one thing I quickly learned was that AI could only do so much. I was able to build a somewhat working prototype by instead using a relatively basic AI (compared to the monstrosities we see these days) and focusing all of my efforts on restructuring the datasets in ways that made it easier for AI to interact with the datasets. This seems to be a similar principle.

Figure 4: Learning curves comparing mean squared error on held-out pairs, estimated using free and direct prediction for training data consisting of either geometrically-sized local neighborhoods or the full set of variables. Unlike the version reported in the main text, no pairs of variables are held out in this fully-observed condition. Even though the model is trained directly on the held-out pairs in the fully-observed condition, there is a substantial data efficiency advantage to using locally-structured training data and free generation at inference time.

The authors also discovered something very interesting about when Step by Step Prompting does not work- the worse a training condition does at matching the true conditional probability, the better it matches the marginal. The language models trained on data with the wrong locality structure generated estimates that were particularly close to the marginal probabilities. When the variables that co-occur with each other frequently are not local in the Bayes net, they often have very little influence on each other. This means that the joint distribution over co-occurring variables is usually very close to the product of the marginal probabilities, i.e. P(X1, X2, X3) ≈ P(X1)P(X2)P(X3) for non-local X1, X2, X3. Without the ability to estimate conditional probabilities accurately, there are no reliable ‘steps’ for step-by-step reasoning to use.

These results combine well to support the main hypothesis of the authors-

reasoning is effective when training data consists of local clusters of variables that influence each other strongly. These training conditions enable the chaining of accurate local inferences in order to estimate relationships between variables that were not seen together in training.

As an interesting aside, the authors noted that learning from local structures resembled human learning. This was fairly interesting because it reminded me off the chess master experiment. In an experiment, chess masters and noobs were asked to look at the configuration of pieces on a chess board and recreate that board from memory on a fresh one. The masters were able to recreate that board using way fewer glances than the noobs. However, what was interesting is that this same experiment was repeated, but this time the pieces were placed at random (creating configurations that would never exist in a chess match). This time there was no difference in performance between noobs and masters.

This experiment was used to show that the superior performance in the first task was not an inherent superiority in Chess Master mental abilities, but rather greater familiarity with chess boards and configurations which leads to superior pattern matching. Pattern matching is the key to expert-level performance, and local structuring might behave enable this pattern matching in LLMs.We covered this experiment in my article, How to Learn and Master Skills on my other publication Tech Made Simple here.

That is it for this piece. I appreciate your time. As always, if you’re interested in reaching out to me or checking out my other work, links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow.

Save the time, energy, and money you would burn by going through all those videos, courses, products, and ‘coaches’ and easily find all your needs met in one place at ‘Tech Made Simple’! Stay ahead of the curve in AI, software engineering, and the tech industry with expert insights, tips, and resources. 20% off for new subscribers by clicking this link. Subscribe now and simplify your tech journey!

Using this discount will drop the prices-

800 INR (10 USD) → 640 INR (8 USD) per Month

8000 INR (100 USD) → 6400INR (80 USD) per year

Get 20% off for 1 year