Skip to main content

Test LLM applications (Python only)

LangSmith functional tests are assertions and expectations designed to quickly identify obvious bugs and regressions in your AI system. Relative to evaluations, tests typically are designed to be fast and cheap to run, focusing on specific functionality and edge cases. We recommend using LangSmith to track any unit tests, end-to-end integration tests, or other specific assertions that touch an LLM or other non-deterministic part of your AI system. These should run on every commit in your CI pipeline to catch regressions early.

note

@unit currently requires langsmith python version >=0.1.74 (named @unit for versions >=0.1.42). If you are interested in unit testing functionality in TypeScript or other languages, please let us know at support@langchain.dev.

Write a @unit

To write a LangSmith functional test, decorate your test function with @unit. If you want to track the full nested trace of the system or component being tested, you can mark those functions with @traceable. For example:

# my_app/main.py
from langsmith import traceable

@traceable # Optional
def generate_sql(user_query):
# Replace with your SQL generation logic
# e.g., my_llm(my_prompt.format(user_query))
return "SELECT * FROM customers"

Then define your test:

# tests/test_my_app.py
from langsmith import test
from my_app.main import generate_sql

@unit
def test_sql_generation_select_all():
user_query = "Get all users from the customers table"
sql = generate_sql(user_query)
# LangSmith logs any exception raised by `assert` / `pytest.fail` / `raise` / etc.
# as a test failure
assert sql == "SELECT * FROM customers"

Run tests

You can use a standard testing framework such as pytest (docs) to run. For example:

pytest tests/

Each time you run this test suite, LangSmith collects the pass/fail rate and other traces as a new TestSuiteResult, logging the pass rate (1 for pass, 0 for fail) over all the applicable tests.

The test suite syncs to a corresponding dataset named after your package or github repository.

Test Example

Going further

@unit is designed to stay out of your way and works well with familiar pytest features. For example:

Defining inputs as fixtures

Pytest fixtures let you define functions that serve as reusable inputs for your tests. LangSmith automatically syncs any test case inputs defined as fixtures. For example:

import pytest

@pytest.fixture
def user_query():
return "Get all users from the customers table"

@pytest.fixture
def expected_sql():
return "SELECT * FROM customers"

# output_keys indicate which test arguments to save as 'outputs' in the dataset (Optional)
# Otherwise, all arguments are saved as 'inputs'
@unit(output_keys=["expected_sql"])
def test_sql_generation_with_fixture(user_query, expected_sql):
sql = generate_sql(user_query)
assert sql == expected_sql

Parametrizing tests

Parametrizing tests lets you run the same assertions across multiple sets of inputs. Use pytest's parametrize decorator to achieve this. For example:

@unit
@pytest.mark.parametrize(
"user_query, expected_sql",
[
("Get all users from the customers table", "SELECT * FROM customers"),
("Get all users from the orders table", "SELECT * FROM orders"),
],
)
def test_sql_generation_parametrized(user_query, expected_sql):
sql = generate_sql(user_query)
assert sql == expected_sql

Note: as the parametrized list grows, you may consider using evaluate() instead. This parallelizes the evaluation and makes it easier to control individual experiments and the corresponding dataset.

Expectations

LangSmith provides an expect utility to help define expectations about your LLM output. For example:

from langsmith import expect

@unit
def test_sql_generation_select_all():
user_query = "Get all users from the customers table"
sql = generate_sql(user_query)
expect(sql).to_contain("customers")

This will log the binary "expectation" score to the experiment results, additionally asserting that the expectation is met possibly triggering a test failure.

expect also provides "fuzzy match" methods. For example:

@unit
@pytest.mark.parametrize(
"query, expectation",
[
("what's the capital of France?", "Paris"),
],
)
def test_embedding_similarity(query, expectation):
prediction = my_chatbot(query)
expect.embedding_distance(
# This step logs the distance as feedback for this run
prediction=prediction, expectation=expectation
# Adding a matcher (in this case, 'to_be_*"), logs 'expectation' feedback
).to_be_less_than(0.5) # Optional predicate to assert against
expect.edit_distance(
# This computes the normalized Damerau-Levenshtein distance between the two strings
prediction=prediction, expectation=expectation
# If no predicate is provided below, 'assert' isn't called, but the score is still logged
)

This test case will be assigned 4 scores:

  1. The embedding_distance between the prediction and the expectation
  2. The binary expectation score (1 if cosine distance is less than 0.5, 0 if not)
  3. The edit_distance between the prediction and the expectation
  4. The overall test pass/fail score (binary)

The expect utility is modeled off of Jest's expect API, with some off-the-shelf functionality to make it easier to grade your LLMs.

Dry-run mode

If you want to run the tests without syncing the results to LangSmith, you can set LANGCHAIN_TEST_TRACKING=false in your environment.

LANGCHAIN_TEST_TRACKING=false pytest tests/

The tests will run as normal, but the experiment logs will not be sent to LangSmith.

Caching

LLMs on every commit in CI can get expensive. To save time and resources, LangSmith lets you cache results to disk. Any identical inputs will be loaded from the cache so you don't have to call out to your LLM provider unless there are changes to the model, prompt, or retrieved data.

To enable caching, run with LANGCHAIN_TEST_CACHE=/my/cache/path. For example:

LANGCHAIN_TEST_CACHE=tests/cassettes pytest tests/my_llm_tests

All requests will be cached to tests/cassettes and loaded from there on subsequent runs. If you check this in to your repository, your CI will be able to use the cache as well.

Using watch mode

With caching enabled, you can iterate quickly on your tests using watch mode without worrying about unnecessarily hitting your LLM provider. For example, using pytest-watch:

pip install pytest-watch
LANGCHAIN_TEST_CACHE=tests/cassettes ptw tests/my_llm_tests

Explanations

The @unit test decorator converts any test into a parametrized LangSmith example. By default, all tests within a given file will be grouped as a single "test suite" with a corresponding dataset. You can configure which test suite a test belongs to by passing the test_suite_name parameter to @unit.

The following metrics are available off-the-shelf:

FeedbackDescriptionExample
passBinary pass/fail score, 1 for pass, 0 for failassert False # Fails
expectationBinary expectation score, 1 if expectation is met, 0 if notexpect(prediction).against(lambda x: re.search(r"\b[a-f\d]{8}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{12}\b", x) )
embedding_distanceCosine distance between two embeddingsexpect.embedding_distance(prediction=prediction, expectation=expectation)
edit_distanceEdit distance between two stringsexpect.edit_distance(prediction=prediction, expectation=expectation)

You can also log any arbitrary feeback within a unit test manually using the client.

from langsmith import unit, Client
from langsmith.run_helpers import get_current_run_tree

client = Client()

@unit
def test_foo():
run_tree = get_current_run_tree()
client.create_feedback(run_id=run_tree.id, key="my_custom_feedback", score=1)

Reference

expect

expect makes it easy to make approximate assertions on test results and log scores to LangSmith. Off-the-shelf, it allows you to compute and compare embedding distances, edit distances, and make custom assertions on values.

expect.embedding_distance(prediction, reference, *, config=None)

Compute the embedding distance between the prediction and reference.

This logs the embedding distance to LangSmith and returns a Matcher instance for making assertions on the distance value.

By default, this uses the OpenAI API for computing embeddings.

Parameters

  • prediction (str): The predicted string to compare.
  • reference (str): The reference string to compare against.
  • config (Optional[EmbeddingConfig]): Optional configuration for the embedding distance evaluator. Supported options:
    • encoder: A custom encoder function to encode the list of input strings to embeddings. Defaults to the OpenAI API.
    • metric: The distance metric to use for comparison. Supported values: "cosine", "euclidean", "manhattan", "chebyshev", "hamming".

Returns

A Matcher instance for the embedding distance value.

expect.edit_distance(prediction, reference, *, config=None)

Compute the string distance between the prediction and reference.

This logs the string distance (Damerau-Levenshtein) to LangSmith and returns a Matcher instance for making assertions on the distance value.

This depends on the rapidfuzz package for string distance computation.

Parameters

  • prediction (str): The predicted string to compare.
  • reference (str): The reference string to compare against.
  • config (Optional[EditDistanceConfig]): Optional configuration for the string distance evaluator. Supported options:
    • metric: The distance metric to use for comparison. Supported values: "damerau_levenshtein", "levenshtein", "jaro", "jaro_winkler", "hamming", "indel".
    • normalize_score: Whether to normalize the score between 0 and 1.

Returns

A Matcher instance for the string distance value.

expect.value(value)

Create a Matcher instance for making assertions on the given value.

Parameters

  • value (Any): The value to make assertions on.

Returns

A Matcher instance for the given value.

Matcher

A class for making assertions on expectation values.

to_be_less_than(value)

Assert that the expectation value is less than the given value.

to_be_greater_than(value)

Assert that the expectation value is greater than the given value.

to_be_between(min_value, max_value)

Assert that the expectation value is between the given min and max values.

to_be_approximately(value, precision=2)

Assert that the expectation value is approximately equal to the given value.

to_equal(value)

Assert that the expectation value equals the given value.

to_contain(value)

Assert that the expectation value contains the given value.

against(func)

Assert the expectation value against a custom function.

test API

The @unit decorator is used to mark a function as a test case for LangSmith. It ensures that the necessary example data is created and associated with the test function. The decorated function will be executed as a test case, and the results will be recorded and reported by LangSmith.

@unit(id=None, output_keys=None, client=None, test_suite_name=None)

Create a test case in LangSmith.

Parameters

  • id (Optional[uuid.UUID]): A unique identifier for the test case. If not provided, an ID will be generated based on the test function's module and name.
  • output_keys (Optional[Sequence[str]]): A list of keys to be considered as the output keys for the test case. These keys will be extracted from the test function's inputs and stored as the expected outputs.
  • client (Optional[ls_client.Client]): An instance of the LangSmith client to be used for communication with the LangSmith service. If not provided, a default client will be used.
  • test_suite_name (Optional[str]): The name of the test suite to which the test case belongs. If not provided, the test suite name will be determined based on the environment or the package name.

Environment Variables

  • LANGSMITH_TEST_TRACKING: Set this variable to the path of a directory to enable caching of test results. This is useful for re-running tests without re-executing the code. Requires the 'langsmith[vcr]' package.

Was this page helpful?


You can leave detailed feedback on GitHub.