LLM evals are a way to evaluate the performance of a language model. They are typically used to measure the accuracy of a model’s responses to a given prompt. What people don’t realize is that since LLMs are stochastic, evals are a statistical process, and as such, they have a lot of statistical properties. This blog post will explore the statistics of LLM evals, and how to use them to get a better understanding of the performance of a model.
Note: This post is a work in progress, while I read through some reference materials on evals. I will update it as I go along.
Table of contents
Open Table of contents
The statistical framing of LLM evals
According to the paper Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations, we can’t just take the mean of the evals to get the performance of a model. We need to take into account the statistical properties of the evals:
Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations.
My initial thinking was that the idea is simple: since LLMs are stochastic in nature, we can think of evals as experiments. A question and answer pair is a realization of a random variable. Hence we need to sample from the population of possible answers to a given question to get the performance of a model, on that question.
My intuition is that, this is essentially the same problem as A/B testing and experimentation, but for LLMs.
Cameron Wolfe, in a substack post Statistics of LLM Evals, references the following framing:
In theory, when evaluating an LLM, there exists a super-population of questions (illustrated below) that exhaustively covers all the ways in which the LLM can be evaluated. Practically speaking, any evaluation dataset represents only a finite subset of questions from this super-population (…).
This frames the problem a little bit differently. Here, his framing is not that the LLM response is a random variable, but that there is a “super-population” of questions that covers all ways you can benchmark the LLM. However, according to this, a specific benchmark dataset is only a finite subset of questions from this super-population.
This framing, according to what I can tell, allows us to consider the problem of evaluation the skill of a model to perform a given task, instead of just attempting to maximize the benchmark score. The difference is subtle, but here it’s the benchmark that is stochastic, and not (just) the model itself.
But, if this is the case, then each evaluation question is the dataset is the realization of the random variable that is the skill of the model to perform a given task. So we can “look through the looking glass”, to quote the paper, and get a better understanding at the score and uncertainty of the model’s skill.
Key recommendations from the paper
The paper provides the following key recommendations:
- When questions are i.i.d., LLM evals should have standard errors computed via CLT.
- If the questions are not i.i.d., meaning they are drawn from clusters or groups, CLT is not applicable. One should use a clustered standard error.
- To reduce the variance, we can resample the outputs from the model multiple times. This, from my viewpoint, is a way to average out the stochasticity of the model’s responses.
- When comparing or evaluating 2 models, one should use paired difference tests. And, one might add, if more than 2 models are being compared, one should use ANOVA tests, or similar approaches.
Details on the dataset used in the paper
There are some details regarding the dataset used in the paper, we should keep in mind. In the first place, the eval dataset has questions, and each question will receive an evaluation score . This score can either be a binary correctness score or an LLM-as-a-Judge score.
A score can be thought of as:
where is the expected score, and is the error term. The error term is assumed to be normally distributed with mean 0 and variance .
Simplest case: i.i.d. questions
The simplest case is when the questions are i.i.d., meaning they are independent. We want to know , i.e., the expected score over the entire population of questions. Since we only have a sample of questions, we can estimate the expected score as the sample mean . We know from statistics, that the sample mean approximates the expected value as the sample size increases. So we can say that .
The standard error of the sample mean is given by:
where is the standard deviation of the scores.
If we assume that the scores are binary, hence following a Bernoulli distribution, i.e., , we have that , where is the sample mean of the scores.
The 95% confidence interval for the expected score is given by:
where 1.96 is the z-score for a 95% confidence interval.
As with all experimental designs, the question boils down to, how large do we need to be to get a desired level of precision, and get a “good” estimate of the expected score?
What about bootstrapping, which is common in ML Model evaluation? The author of the paper suggests that while this is valid and common in LLM evals, it is not necessary when the CLT applies. For reference, bootstrapping is the process where we:
- Sample with replacement from the dataset
- Calculate the sample mean
- Repeat steps 1 and 2 times
- The bootstrap standard error is the standard deviation of the sample means.
Clustered questions
If the questions are not independent, meaning they are drawn from the same cluster or group, we need to use a clustered standard error. The reason is that CLT not longer applies when the questions are not i.i.d., because CLT underestimates the variance of the sample mean. Which results in a tighter confidence interval than we would get if we used the standard error.
What constitutes a cluster of non-independent questions? One example is the same prompt in different languages, or references to the same document or source. Something like this: Source Passage: A short biography of Nikola Tesla. Question A: “In what year was Tesla born?” Question B: “Who was his main rival in the ‘War of Currents’?” Question C: “Which laboratory did he establish in 1899?”
The clustered standard error is given by:
Obviously, now we are assuming that the clusters are independent, it’s just the questions within the cluster that are not independent.