Five Tools to Help You Leverage Prompt Versioning in Your LLM Workflow

Published on
May 3, 2024

In our experience, fine-tuning inputs to coax the right answers out of language models requires some level of prompt versioning, otherwise it becomes extremely difficult to keep track of changes past a couple of versions. 

So the next question becomes: What kind of versioning system or tool is best?

In our experience, you’ll want a versioning system that’s adequate to the complexity of your prompts. For prompts that are simple—like pure text or even f-strings containing a placeholder or two—you could consider existing structures and tools such as:

  • Lists and nested dictionaries for defining a base prompt, inserting variables as needed from another dictionary, with some rudimentary versioning thrown in.
  • Git, or external configurations like environment variables.

These methods could work for simpler prompts, but they’re not terribly efficient and may break down at scale because of growing overhead, or because they’re not well integrated with whatever prompt engineering tool or workflow you’re using. 

We’d even argue that, due to their highly iterative and experimental nature, prompts deserve special management. They require a versioning system that’s specifically designed to handle the dynamic nature of prompt development, where each variation needs to be tracked, compared, and reverted in order to quickly refine prompts and deploy them in a controlled, traceable manner.

In fact, our experience suggests that any prompt management system should, at a minimum:

  • Organize a central workspace (whether that workspace is local or located on a remote hub) for managing prompts.
  • Offer a structured and intuitive versioning system, along with a way to document changes that were made in each version.
  • Enable easy navigation between different versions, and the ability to retrieve any specific version.
  • Provide an audit trail of who made changes, when these changes were made, and why.
  • Be well integrated with your prompt development tool, which in turn either integrates with, or supports LLM pipeline functionalities.

Below, we list five different prompting tools that implement prompt management, starting with an in depth discussion on how we designed Mirascope, our own toolkit for developing LLM applications, with respect to the discussion above.

  1. Mirascope—software engineering-inspired prompt management.
  2. LangSmith—prompt versioning from a central hub.
  3. Agenta—wide compatibility with LLM app frameworks.
  4. Pezzo—visual prompt management and testing.
  5. PromptHub—for collaboration between tech and non-tech roles.

1. Mirascope—Prompt Management Inspired by Software Engineering

Mirascope homepage: An intuitive approach to building with LLMs

Mirascope is a lightweight Python toolkit for building production-grade LLM applications. We built Mirascope as a modular library with scalability in mind. For example, we provide a base prompt class that you can extend as needed, rather than a host of prompt templates for every conceivable use case. Where feasible, our library recommends native Python for many tasks rather than offering complex abstractions that require a learning curve.

Our prompt management system works along similar lines. Among other things, we:

  • Provide a local working directory for your LLM prompts
  • Follow version control best practices
  • Group everything that affects the quality of an LLM call, including the prompt, together as one cohesive unit (colocation)
  • Assure the integrity of inputs to your prompts

These principles are described in detail in the list of Mirascope features below:

Track Changes in Prompts to Efficiently Manage Them

In the following sections, we describe how we make it as easy as possible to track changes to your prompt code by colocating prompts with LLM calls, and by managing prompts with the Mirascope CLI.

Colocating Prompts with LLM Calls

In previous experiences working with prompts, we found that it’s easy to lose oversight if you scatter everything that affects the quality of the LLM call (like the model parameters or other relevant prompt code) throughout the codebase.

We find this to be a common issue with other frameworks and libraries that don’t enforce colocation, since it becomes harder to trace the effects of changes made to one part of your code, and introduces unnecessary complexity.

Mirascope therefore makes the LLM call the central primitive around which everything, including the prompt, is versioned. It’s hard to overstate the advantages to the developer experience in doing this, especially when your library is making over100 LLM calls within an enterprise-level LLM application.

An example of colocation is shown below; `call_params` encapsulates parameters needed for making an OpenAI API call within the context of the `MusicProducer` class. 

1import os
2
3from mirascope import tags
4from mirascope.openai import OpenAICall, OpenAICallParams
5
6os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
7
8@tags(["version:0003"])
9
10class MusicProducer(OpenAICall):
11    prompt_template = """
12    SYSTEM:
13    You are an acclaimed music producer.
14    
15    USER:
16    I'm working on a new song. What do you think?
17    {song_lyrics}
18    """
19    
20    song_lyrics: str
21    
22    call_params = OpenAICallParams(model="gpt-4", temperature=0.4)
23
24
25song_lyrics = "..."
26music_producer = MusicProducer(song_lyrics=song_lyrics)
27
28print(music_producer.messages())
29# > [{'role': 'system', 'content': 'You are an acclaimed music producer.'}, {'role': 'user', 'content': "I'm working on a new song. What do you think?\n..."}]
30
31feedback = music_producer.call()
32print(feedback.content)
33# > I think the chorus is catchy, but...
34
35print(music_producer.dump() | feedback.dump())
36# {
37#     "tags": ["version:0003"],
38#     "template": "SYSTEM:\nYou are an acclaimed music producer.\n\nUSER:\nI'm working on a new song. What do you think?\n{song_lyrics}",
39#     "inputs": {"song_lyrics": "..."},
40#     "start_time": 1710452778501.079,
41#     "end_time": 1710452779736.8418,
42#     "output": {
43#         "id": "chatcmpl-92nBykcXyTpxwAbTEM5BOKp99fVmv",
44#         "choices": [
45#             {
46#                 "finish_reason": "stop",
47#                 "index": 0,
48#                 "logprobs": None,
49#                 "message": {
50#                     "content": "I think the chorus is catchy, but...",
51#                     "role": "assistant",
52#                     "function_call": None,
53#                     "tool_calls": None,
54#                 },
55#             }
56#         ],
57#         "created": 1710452778,
58#         "model": "gpt-4-0613",
59#         "object": "chat.completion",
60#         "system_fingerprint": None,
61#         "usage": {"completion_tokens": 25, "prompt_tokens": 33, "total_tokens": 58},
62#     },
63# }


You can additionally see the version number (as provided by our CLI) near the top of the code sample above. The entire prompt, including the LLM call, gets versioned together as a single unit.

Managing Prompts with a CLI

Mirascope comes with a prompt management CLI right out of the box, allowing you to set up a local prompt development environment with change tracking. Inspired by Alembic, the Mirascope CLI helps you iterate faster on prompts and their calls, providing a structured way to refine and test prompts easily while maintaining a clean and organized environment.

|
|-- mirascope.ini
|-- mirascope
|   |-- prompt_template.j2
|   |-- versions/
|   |   |-- <directory_name>/
|   |   |   |-- version.txt
|   |   |   |-- <revision_id>_<directory_name>.py
|-- prompts/


This local prompt environment contains subdirectories and files for:

  • Configuring your project
  • Tracking changes to prompts
  • Saving prompts as separate versions
  • Utilizing the Jinja2 templating system

Mirascope also provides a set of CLI commands for:

  • Initializing your project and setting up the working directory
  • Committing prompts to the working directory for versioned archival
  • Removing prompts from the working directory
  • Iterating on prompts and their versions
  • Rolling back a prompt to a previous version
  • Checking the status of a prompt

As already mentioned, whenever you add a prompt file to the working directory, Mirascope automatically adds the version number to both the file title and to the inside of the file (via the `@tags` decorator). We use sequential numbered versioning, e.g., 0001 -> 0002 -> 0003, etc.

Consistently Ensure Prompt Quality with Built-in Validation

We aim to reduce the amount of potential errors where feasible by implementing automatic data validation, in particular type safety in prompts.

To accomplish this, our `BasePrompt` class extends Pydantic’s `BaseModel` to ensure reliable error handling and to not require you to have to build custom validation logic for handling type errors from your plate.

Also, documentation and linting for both Mirascope and Pydantic are available in your IDE to make your development workflows more efficient.

A missing argument in a Mirascope class:

Missing argument in a Mirascope class example

Autosuggestion proposal:

Autosuggestion proposal with Mirascope

You can additionally add custom validation, for example, in cases where you’d like the LLM to do certain validation tasks that would be harder to accomplish with just manual coding.

For instance, you could ensure that certain generated content maintains a consistent brand voice by using the LLM to make the determination, and then adding Pydantic’s `AfterValidator` to the extracted output:

1from enum import Enum
2from typing import Annotated, Type
3
4from mirascope.openai import OpenAIExtractor
5from pydantic import AfterValidator, BaseModel, ValidationError
6
7class Label(Enum):
8    ON_BRAND = "on brand"
9    OFF_BRAND = "off brand"
10
11class BrandCompliance(OpenAIExtractor[Label]):
12    extract_schema: Type[Label] = Label
13    prompt_template = "Does the following content adhere to the brand guidelines? {text}."
14
15    text: str
16
17def validate_brand_compliance(content: str) -> str:
18    """Check if the content follows the brand guidelines."""
19    label = BrandCompliance(text=content).extract()
20    assert label == Label.ON_BRAND, "Content did not adhere to brand guidelines."
21    return content
22
23class BrandCompliantContent(BaseModel):
24    content: Annotated[str, AfterValidator(validate_brand_compliance)]
25
26class ContentReviewer(OpenAIExtractor[BrandCompliantContent]):
27    extract_template: Type[BrandCompliantContent] = BrandCompliantContent
28    prompt_template = "Please generate content for our new marketing campaign."
29
30try:
31    content = ContentReviewer().extract()
32except ValidationError as e:
33    print(e)
34    # > 1 validation error for BrandCompliantContent
35    #   content
36    #     Assertion failed, Content did not adhere to brand guidelines. [type=assertion_error, input_value="The generated marketing copy...", input_type=str]
37    #       For further information visit https://errors.pydantic.dev/2.6/v/assertion_error


In the code above, `BrandCompliance` checks adherence of the content to the company's brand guidelines, and the `validate_brand_compliance` function validates this adherence using an assertion based on the LLM's output.

The `BrandCompliantContent` model uses `AfterValidator` to apply this custom validation logic.

Test and Evaluate Prompts

Effective prompt management involves not only tracking changes to prompts over time, but also testing them for performance and improving them based on those results.

Mirascope integrates with the Weights & Biases library, which is a tool for visualizing and tracking machine learning experiments. 

For example, you can use the `Weave` toolkit, which logs, debugs, evaluates, and organizes all aspects of LLM workflows, from experimentation to production, via the `with_weave` decorator (which allows you to log your runs to Weave), as shown below:

1import weave
2
3from mirascope.openai import OpenAICall
4from mirascope.wandb import with_weave
5
6weave.init("my-project")
7
8@with_weave
9class MovieRecommender(OpenAICall):
10    prompt_template = "Please recommend some {genre} movies"
11
12    genre: str
13
14recommender = MovieRecommender(genre="sci-fi")
15response = recommender.call()  # this will automatically get logged with weave
16print(response.content)


Mirascope also offers `WandbCallMixin` to internally call Weights & Biases’ `Trace()` function (using their original Prompts tool) for logging your runs, as well as additional functions for generating content and using it to extract structured data.

Track Prompt Evolution Through Comprehensive Data Logging

Mirascope’s `.dump()` function allows you to track changes in prompts over time and ensure reproducibility of results under similar experimental conditions. You can see exactly what was sent to the model, helping you understand how changes to input values or parameters affect the model's responses.

`.dump()` outputs a data dictionary of prompts, calls, and responses. For example, for a given prompt, the function might output:

{
    "template": "How can I assist you today, {name}?",
    "inputs": {
        "name": "Alice"
    },
    "tags": ["customer_support", "version:002"],
    "call_params": {
        "model": "gpt-4",
        "max_tokens": 150,
        "temperature": 0.7
    },
    "start_time_ms": 1652928064000,
    "end_time_ms": 1652928065000
}


Each prompt iteration might involve tweaks to the prompt structure, changes in the parameters fed into the model, or adjustments in how the data is processed before being sent to the model. The `.dump()` function serializes these changes into a structured format that can be logged and tracked over time

You invoke `.dump()` directly from Mirascope’s `BasePrompt` or any of its subclasses (e.g., `BaseCall`) to obtain information such as the prompt template, its tags, and any parameters specific to the model provider’s API call.

An example of calling `.dump()` is shown below:

1import os
2
3from mirascope import tags
4from mirascope.openai import OpenAICall
5
6os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
7
8@tags(["recipe_project", "version:0001"])
9
10class RecipeRecommender(OpenAICall):
11    prompt_template = "Can you recommend some recipes that include {ingredient}?"
12
13    ingredient: str
14
15
16recommender = RecipeRecommender(ingredient="chocolate")
17print(recommender.dump())
18
19"""
20Output:
21{
22    "template": "Can you recommend some recipes that include {ingredient}?",
23    "inputs": {"api_key": None, "ingredient": "chocolate"},
24    "tags": ["recipe_project", "version:0001"],
25    "call_params": {"model": "gpt-4"},
26    "start_time_ms": None,
27    "end_time_ms": None,
28}
29"""
30
31recommender.call()
32print(recommender.dump())
33
34"""
35Output:
36{
37    # ... same as above
38    "start_time_ms": 1709847166609.473,
39    "end_time_ms": 1709847169424.146,
40}
41"""


Other functionality includes:

  • Dumping from responses (which contains start and end times of the response, and call parameters to the API).
  • Combining both `BasePrompt.dump()` and `response.dump()` as a union of both outputs, which is useful for comprehensive debugging, as you get a full snapshot of both the request and the response in a single view.

2. LangSmith—Prompt Versioning from a Central Hub

LangSmith homepage: Get your LLM app from prototype to production

LangSmith is a tool for managing and optimizing the performance of chains and intelligent agents in LLM applications. Its parent framework is LangChain, with which it integrates, although you can use LangSmith on its own.

LangSmith offers a hub (LangChain Hub) that’s a centralized prompt repository, with functionality for archiving and versioning prompts. To use prompts that are saved to LangSmith Hub, you typically use the `pull` command specifying the prompt to download, along with its commit hash (version).

The hub is a prompt management environment to which you push prompts and their changes to it. It also lets you manage and run chains containing the prompts.

You can find more information about how to manage prompts in LangChain Hub in its documentation and on its website.

3. Agenta—Wide Compatibility with LLM App Frameworks

Agenta homepage: Your Collaborative AI Development Platform

Agenta is an open source LLM application development platform that offers a suite of tools for prompt management and evaluation. 

The platform decouples prompts and the model (together known as the configuration, or prompt management system, which is managed on the backend) from the logic of the LLM application. This allows you to separately test different configurations (using JSON-based test sets) without having to modify your application codebase.

Agenta offers a playground where you can experiment with prompts and applications (which are treated as microservices). You can version each of these combinations, known as application variants, to ease application development.

For more information on how Agenta’s prompt management works, you can consult its user documentation or see details on GitHub.

4. Pezzo—Visual Prompt Management and Testing

Pezzo homepage: The Developer-First AI Platform

Pezzo is an open source LLMOps platform designed for prompt engineering, and offers a number of GUI-based prompt management features.

  • You create prompts by using the Prompt Editor GUI, and specify settings such as temperature and max response length.
  • The platform allows you to test prompts to assess results such as cost, token usage, and completion duration.
  • You can version prompts by committing them and publishing the version to a given environment, like a production environment. Pezzo uses SHA hashes as version numbers.
  • Once versioned, you can revert a prompt to a previous version or view the version history of a prompt.

Pezzo describes its prompt management capabilities in its documentation on its website.

5. PromptHub—For Collaboration Between Tech & Non-Tech Roles

PromptHub homepage: Level up your prompt management

PromptHub is a prompt management platform for teams. It lets you test, collaborate, and deploy prompts, and features built-in prompt versioning, comparison, and approval workflows.

The SaaS platform offers a Git-like prompt versioning system based on SHA hashes, allowing you to commit prompts and open merge requests to collaborate on prompt design, all within a central GUI. It also offers functionality for comparing prompt versions side by side, and approving or rejecting changes.

All changes are logged and team members are automatically notified. PromptHub also provides an API letting you access your prompts from any other application.

You can find more information on PromptHub in its documentation and on its website.

Discover the Benefits of Developer-Friendly Prompt Management

Mirascope’s prompt versioning functionality was built from the ground up with software engineering best practices in mind. It offers an accessible interface for you to easily track changes across different versions of prompts to improve collaboration, and to support ongoing prompt experimentation and iteration.

Want to learn more? You can find more Mirascope code samples on both our documentation site and on GitHub.

Join our beta list!

Get updates and early access to try out new features as a beta tester.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.