Skip to main content
Create and manage LLM-as-judge evaluators and their versions programmatically. Evaluators use prompt templates with {{variable}} placeholders that reference span or trace attributes to automatically score your LLM application’s outputs.

Key Capabilities

  • Create template-based LLM-as-judge evaluators within a space
  • Version evaluators with commit messages (versions are immutable once created)
  • Retrieve evaluators with their latest or a specific version
  • List, update, and delete evaluators
  • List and retrieve individual evaluator versions

List Evaluators

evaluators operations are currently in ALPHA. A one-time warning is emitted on first use.
List all evaluators you have access to, with optional filtering by space.
resp = client.evaluators.list(
    space="your-space-name-or-id",  # optional
    name="Relevance",               # optional substring filter
    limit=50,
)

for evaluator in resp.evaluators:
    print(evaluator.id, evaluator.name)
For details on pagination, field introspection, and data conversion (to dict/JSON/DataFrame), see Response Objects.

Create an Evaluator

Create a new evaluator with an initial version. Evaluator names must be unique within the target space.
from arize._generated.api_client.models import TemplateConfig, EvaluatorLlmConfig

evaluator = client.evaluators.create(
    name="Relevance",
    space="your-space-name-or-id",
    commit_message="Initial version",
    description="Scores whether the response is relevant to the query",
    template_config=TemplateConfig(
        name="Relevance",
        template="Is the following response relevant to the query?\nQuery: {{input.value}}\nResponse: {{output.value}}",
        include_explanations=True,
        use_function_calling_if_available=True,
        classification_choices={"relevant": 1, "irrelevant": 0},
        direction="maximize",
        llm_config=EvaluatorLlmConfig(
            ai_integration_id="your-ai-integration-id",
            model_name="gpt-4o",
            invocation_parameters={"temperature": 0},
        ),
    ),
)

print(evaluator.id, evaluator.name)

Template Variables

Template strings use {{variable}} placeholders that reference span or trace attributes (e.g., {{input.value}}, {{output.value}}, {{attributes.my_custom_attr}}).

Classification vs. Freeform Output

  • Classification — Provide classification_choices as a dict[str, float] mapping label → numeric score (e.g., {"relevant": 1, "irrelevant": 0}). The evaluator outputs one of these labels along with its score.
  • Freeform — Omit classification_choices. The evaluator produces a numeric score without predefined labels.

Get an Evaluator

Retrieve an evaluator by name or ID. By default the latest version is returned. When using a name, provide space to disambiguate.
evaluator = client.evaluators.get(evaluator="your-evaluator-name-or-id")

print(evaluator.id, evaluator.name)
print(evaluator.version)

Get a Specific Version

evaluator = client.evaluators.get(
    evaluator="your-evaluator-name-or-id",
    version_id="specific-version-id",
)

Update an Evaluator

Update an evaluator’s metadata (name and/or description). To change the template configuration, create a new version instead.
evaluator = client.evaluators.update(
    evaluator="your-evaluator-name-or-id",
    name="Relevance v2",
    description="Updated description",
)

print(evaluator)

Delete an Evaluator

Delete an evaluator and all its versions. This operation is irreversible. There is no response from this call.
client.evaluators.delete(evaluator="your-evaluator-name-or-id")

print("Evaluator deleted successfully")

Manage Versions

Evaluator versions are immutable once created. To change the template configuration, create a new version — it becomes the latest version immediately.

List Versions

List all versions for an evaluator.
resp = client.evaluators.list_versions(
    evaluator="your-evaluator-name-or-id",
    limit=50,
)

for version in resp.evaluator_versions:
    print(version.id, version.commit_message)
For details on pagination, field introspection, and data conversion (to dict/JSON/DataFrame), see Response Objects.

Get a Version

Retrieve a specific evaluator version by its ID.
version = client.evaluators.get_version(version_id="your-version-id")

print(version.id, version.commit_message)

Create a New Version

Add a new version to an existing evaluator. The new version becomes the latest immediately.
from arize._generated.api_client.models import TemplateConfig, EvaluatorLlmConfig

version = client.evaluators.create_version(
    evaluator="your-evaluator-name-or-id",
    commit_message="Improved prompt for edge cases",
    template_config=TemplateConfig(
        name="Relevance",
        template="Rate the relevance of the response on a scale of 0 to 1.\nQuery: {{input.value}}\nResponse: {{output.value}}",
        include_explanations=True,
        use_function_calling_if_available=True,
        classification_choices={"relevant": 1, "irrelevant": 0},
        direction="maximize",
        llm_config=EvaluatorLlmConfig(
            ai_integration_id="your-ai-integration-id",
            model_name="gpt-4o",
            invocation_parameters={"temperature": 0},
        ),
    ),
)

print(version.id)
Learn more: Online Evaluations Documentation