AI Hinges on Evals | The Pragmatic Leader

As companies everywhere rush to build generative AI applications, the underlying LLM-based architecture brings new challenges to quality assurance teams. The recent crop of new LLMs - talk about a potentially evergreen statement! - is sure to cause another wave of adoption and upgrade across the industry. So now is a great time for leaders to become acquainted with those challenges so they can support their teams.

Six months ago, I approached this topic by talking about the importance of evaluations in the process of developing, maintaining, and improving generative AI products, tools, and applications. It’s still a major area of focus. Here’s how my thinking has evolved since.

BLUF

Bottom line up front, in three parts:

Evals and the concept of generative AI product quality assessment are even more important than I thought because mastering evals opens the door to having continuously improving apps.
The field is too green to outsource the responsibility to the emerging industry of providers because it might be a limiting tactic
And rolling out your own still requires significant elbow grease

CONTEXT

(No, not that context. The old meaning.)

Old World

In classical software applications, a fixed UI constrains user input that’s then fed to an algorithm that produces a deterministic output. To test the application quality, pairs of matched input-output generally suffice because the assessment is (generally) direct, consistent, and objective.

New World

With generative AI apps, which I define as having one or more LLM calls, a chat/voice-based UI frees the user to input anything they want to an LLM, itself steered by an unconstrained prompt to generate a non-deterministic output. The quality assessment has objective and subjective dimensions, thus warranting new testing techniques, which we’ll look at in a minute. But first, an illustrative example.

EXAMPLE

Here’s one of JustAnswer’s generative AI apps. The customer funnel contains an e-commerce bot that performs intake and routes the customer to the right expert. To do its job, that bot makes a dozen LLM calls, each with its own prompt and some chained, to three different models per customer volley during a multi-turn customer interaction. A change to any of the twelve prompts of the three models will affect our ability to match customers with experts. It’s a lot of moving parts, but also lots of opportunities for improvement.

And therein lies the rub. We’ve wanted to reevaluate the performance of our generative AI applications more frequently than we have. In practice, we’ve relied too long on slow, production-based A/B tests.

Why do we see a more frequent need to change things and retest?

POP QUIZ

Several factors can trigger the need for a quality assessment:

Genesis: Determining if a new generative AI app is better than what was there pre-genAI, or even an improvement over not having the capability at all.
Climate change: Evolving operating conditions, such as consumer preferences, may warrant adjusting the prompts to maintain performance.
Model swap: Different models, even from the same company and the same family, have exhibited very different behaviors. Want to take advantage of token depreciation? Retest.
Optimization: Tweaks to the prompt (system, user, example, …) can eek out significant performance improvements, but will require validation.

Now, let’s dive into the nature of a test.

RUBRIC

The key element is a good rubric of what constitutes good or good enough to consider a test successful. This definition depends entirely on the business domain, so it’s difficult to outsource part of the quality testing job. Just like delegating work to human contractors does not absolve us from being crisp about the desired outcomes, delegating to a vendor still requires defining what good (or better) looks like. And as we know from our regular attempts at assessing human performance, that can be tricky and multi-faceted.

As previously stated, some of the criteria can be objective while others can be subjective. Here are some potential evals. In reality, you are only bound by your imagination. Sometimes, I like to picture them arranged in a Maslow-like hierarchy of needs:

style
empathy, creativity
attribution, completeness, recall, bias
consistency, relevance, coherence, adherence
latency, responsiveness, cost, readability, harmfulness, truthfulness

GRADING

From here, one creates one or more tests per criterion and runs the suite, usually aided by a test runner framework to yield a composite score. That score can be compared to previous runs of the quality assessment to decide whether the change in the prompt and/or the model made things better or worse.

Thankfully, there is a lot of help available to wrangle the near-infinite number of dimensions one might want to test. The ecosystem of tools and open-source projects is expanding by the day. As of this writing, searching for any of the following points to the entrance of the rabbit hole: Promptfoo, DeepEval, RAGAS, OpenAI Evals.

Because the state of the art (SOTA) changes fast, it still feels reasonable to remain flexible by blending exploitation and exploration. My teams are currently evaluating a number of techniques and tools to see what works best for us. So, I’m not quite ready to endorse anything specific yet. I’m sure I’ll revisit this topic once we’ve accelerated the testing of our dozens of genAI apps.

Until then, here’s one more important reason to become great at evals.

EXTRA CREDIT

What’s particularly tantalizing about having good evals is that it opens the door to automatically adaptable and improving systems. In theory, and increasingly in practice, evals can run constantly against a live system while an optimization process tweaks prompts and models in search of improvement. Think of a permanent multi-arm bandit system. The popular DSPy open-source project is pointed in that direction. And that’s next on my list of things to explore.