How to Use Promptfoo for LLM Testing

How to Use Promptfoo for LLM Testing blog post fallback blur image How to Use Promptfoo for LLM Testing blog post main image
Stephen CollinsFeb 14, 2024
7 mins

What you will learn

  • Why is the principle 'untested software is broken software' emphasized in LLM development?
  • This principle highlights the necessity of rigorous testing in software development, especially for large language models (LLMs). Systematic evaluation of LLM outputs prevents ineffective solutions and enhances application robustness by ensuring that outputs meet quality standards.
  • What is Promptfoo and how does it improve LLM development?
  • Promptfoo is a CLI and library designed to support test-driven development for LLM applications. It allows developers to systematically test various aspects of LLM outputs, like JSON responses and cost effectiveness, making the evaluation process more efficient and structured compared to traditional trial-and-error methods.
  • What types of assertions can be used in Promptfoo to evaluate LLM outputs?
  • Promptfoo allows various assertions such as cost assertions to manage resource efficiency, contains-JSON assertions to ensure valid output formats, answer-relevance assertions to verify thematic consistency, and LLM-rubric assertions for qualitative evaluation based on creativity or detail.
  • How does Promptfoo facilitate comparison between different LLM outputs?
  • Promptfoo enables side-by-side comparisons of outputs from multiple LLM providers, helping developers identify quality variances and regressions quickly, which is crucial for determining the best-performing model for a given application.
  • What advantages does using Promptfoo provide for LLM application development?
  • Promptfoo offers numerous advantages including being battle-tested for scalability, simplicity in defining evaluations, language agnosticism for integration in various coding environments, and collaboration features that improve teamwork, all while ensuring privacy as it runs locally and is open-source.

“Untested software is broken software.”

As developers writing code for production environments, we deeply embrace this principle, and it holds particularly true in the context of working with large language models (LLMs). In order to develop robust applications, the capability to systematically evaluate LLM outputs is indispensable. Relying on traditional trial-and-error approaches not only proves to be inefficient but frequently results in less-than-ideal outcomes.

Enter Promptfoo, a cutting-edge CLI and library designed to revolutionize how we approach LLM development through a test-driven framework. In this tutorial, I’ll explore Promptfoo, showcasing its capabilities such as testing JSON model responses, model costs, and adherence to instructions, by walking you through a sample project focused on inventive storytelling.

You can access all the code in the companion GitHub repository that accompanies this blog post.

What is Promptfoo?

Promptfoo is a comprehensive tool that facilitates the evaluation of LLM output quality in a systematic and efficient manner. It allows developers to test prompts, models, and Retrieval-Augmented Generation (RAG) setups against predefined test cases, thereby identifying the best-performing combinations for specific applications. With Promptfoo, developers can:

  • Perform side-by-side comparisons of LLM outputs to detect quality variances and regressions.
  • Utilize caching and concurrent testing to expedite evaluations.
  • Automatically score outputs based on predefined expectations.
  • Integrate Promptfoo into existing workflows either as a CLI or a library.
  • Work with a wide range of LLM APIs, including OpenAI, Anthropic, Azure, Google, HuggingFace, open-source models like Llama, and even custom API providers.

The philosophy behind Promptfoo is simple: embrace test-driven development for LLM applications to move beyond the inefficiencies of trial-and-error. This approach not only saves time but also ensures that your applications meet the desired quality standards before deployment.

Demo Project: Creative Storytelling with Promptfoo

To illustrate the capabilities of Promptfoo, let’s go over our demo project centered on creative storytelling. This project uses a configuration file (promptfooconfig.yaml) that defines the evaluation setup for generating diary entries set in various contexts, such as a mysterious island, a futuristic city, and an ancient Egyptian civilization.

Project Setup

Writing the Prompt

The core of our evaluation is the prompt defined in prompt1.txt, which instructs the LLM to generate a diary entry from someone living in a specified context (e.g., a mysterious island). The output must be a JSON object containing metadata (person’s name, location, date) and the diary entry itself. Here’s the entire prompt1.txt for our project:

Write a diary entry from someone living in {{topic}}.
Return a JSON object with metadata and the diary entry.
The metadata should include the person's name, location, and the date.
The date should be the current date.
The diary entry key should be named "diary_entry" and as a raw string.

An example of the expected output is:

{
  "metadata": {
    "name": "John Doe",
    "location": "New York",
    "date": "2020-01-01"
  },
  "diary_entry": "Today was a good day."
}

Pretty simple prompt asking the LLM for JSON output. Promptfoo uses Nunjucks templates (the {{topic}} in the prompt1.txt) to be able to include variables from the promptfooconfig.yaml.

More information can be found on Promptfoo’s Input and output files doc.

The promptfooconfig.yaml

The promptfooconfig.yaml file outlines the structure of our evaluation. It includes a description of the project, specifies the prompts, lists the LLM providers (with their configurations), and defines the tests with associated assertions to evaluate the output quality based on cost, content relevance, and specific JSON structure requirements. The example promptfooconfig.yaml isn’t too long, and here is the whole file:

description: "Creative Storytelling"
prompts: [prompt1.txt]
providers:
  - id: "mistral:mistral-medium"
    config:
      temperature: 0
      max_tokens: 1000
      safe_prompt: true
  - id: "openai:gpt-3.5-turbo-0613"
    config:
      temperature: 0
      max_tokens: 1000
  - id: "openai:gpt-4-0125-preview"
    config:
      temperature: 0
      max_tokens: 1000
tests:
  - vars:
      topic: "a mysterious island"
    assert:
      - type: cost
        threshold: 0.002
      - type: "contains-json"
        value:
          {
            "required": ["metadata", "diary_entry"],
            "type": "object",
            "properties":
              {
                "metadata":
                  {
                    "type": "object",
                    "required": ["name", "location", "date"],
                    "properties":
                      {
                        "name": { "type": "string" },
                        "location": { "type": "string" },
                        "date": { "type": "string", "format": "date" },
                      },
                  },
                "diary_entry": { "type": "string" },
              },
          }
  - vars:
      topic: "a futuristic city"
    assert:
      - type: answer-relevance
        value: "Ensure that the output contains content about a futuristic city"
      - type: "llm-rubric"
        value: "ensure that the output showcases innovation and detailed world-building"
  - vars:
      topic: "an ancient Egyptian civilization"
    assert:
      - type: "model-graded-closedqa"
        value: "References Egypt in some way"

The Assertions Explained

Promptfoo offers a versatile suite of assertions to evaluate LLM outputs against predefined conditions or expectations, ensuring the outputs meet specific quality standards. These assertions are categorized into deterministic eval metrics and model-assisted eval metrics. Here’s a deep dive into each assertion used in the preceding example promptfooconfig.yaml for our creative storytelling project.

Cost Assertion

The cost assertion verifies if the inference cost of generating an output is below a predefined threshold. It’s crucial for managing computational resources effectively, especially when scaling LLM applications. In our example, the assertion ensures that generating a diary entry for “a mysterious island” remains cost-effective, with a threshold set at 0.002.

Contains-JSON Assertion

This assertion (contains-json) checks whether the output contains valid JSON that matches a specific schema. It’s particularly useful for structured data outputs, ensuring they adhere to the expected format. In the creative storytelling example, this assertion validates the JSON structure of the diary entry, including required fields like metadata (with subfields name, location, and date) and diary_entry.

Answer-Relevance Assertion

The answer-relevance assertion evaluates whether the LLM output is relevant to the original query or topic. This ensures that the model’s responses are on-topic and meet the user’s intent. For the futuristic city prompt, this assertion confirms that the content indeed revolves around a futuristic city, aligning with the user’s request for thematic accuracy.

LLM-Rubric Assertion

An llm-rubric assertion uses a Language Model to grade the output against a specific rubric. This method is effective for qualitative assessments of outputs, such as creativity, detail, or adherence to a theme. For our futuristic city scenario, this assertion evaluates whether the output demonstrates innovation and detailed world-building, as expected for a narrative set in a futuristic environment.

Model-Graded-ClosedQA Assertion

This model-graded-closedqa assertion uses Closed QA methods (based on the OpenAI Evals) to ensure that the output adheres to specific criteria. It’s beneficial for factual correctness and thematic relevance. In the case of “an ancient Egyptian civilization,” this assertion verifies that the output references Egypt in some manner, ensuring historical or thematic accuracy.

Running the Evaluation

With Promptfoo, executing this evaluation is straightforward. Developers can run tests using the command line, allowing Promptfoo to compare outputs from different LLMs based on the specified criteria. This process helps in identifying which LLM performs best for creative storytelling within the defined parameters. I’ve provided a simple test script (leveraging npx) that can be found on the package.json of the project, and run like the following from the root of the repository:

npm run test

Analyzing the Results

Promptfoo produces matrix views that enable quick evaluation of outputs across multiple prompts and inputs in the terminal, as well as a web UI for more in-depth exploration of the test results. These features are invaluable for spotting trends, understanding model strengths and weaknesses, and making informed decisions about which LLM to use for your specific application.

For more information on viewing the Promptfoo’s test results, check out Promptfoo’s Usage docs.

Why Choose Promptfoo?

Promptfoo stands out for several reasons:

  • Battle-tested: Designed for LLM applications serving millions of users, Promptfoo is both robust and adaptable.
  • Simple and Declarative: Define evaluations without extensive coding or the use of cumbersome notebooks.
  • Language Agnostic: Work in Python, JavaScript, or your preferred language.
  • Collaboration-Friendly: Share evaluations and collaborate with teammates effortlessly.
  • Open-Source and Private: Promptfoo is fully open-source and runs locally, ensuring your evaluations remain private.

Conclusion

Promptfoo may very well become the Jest of LLM application testing.

By integrating Promptfoo into your development workflow (and CI/CD process), you can significantly enhance the efficiency, quality, and reliability of your LLM applications.

Whether you’re developing creative storytelling applications or any other LLM-powered project, Promptfoo offers the features and flexibility needed to add confidence to your LLM integrations through a robust set of testing utilities.