Skip to main content
The experiments client methods are currently in BETA. The API may change without notice. A one-time warning is emitted on first use.
Track and evaluate changes to prompts, models, and retrieval strategies. Run experiments with automatic tracing and evaluation.

Key Capabilities

  • Automatic tracing of all LLM calls during experiments
  • Concurrent execution for faster evaluation
  • Dry-run mode for testing without logging
  • Built-in evaluator support
  • Compare experiments side-by-side in the UI

List Experiments

List all experiments, optionally filtered by dataset or space.
resp = client.experiments.list(
    dataset="dataset-name-or-id",   # optional
    space="your-space-name-or-id",  # optional
    limit=50,
)

for experiment in resp.experiments:
    print(experiment.id, experiment.name)
For details on pagination, field introspection, and data conversion (to dict/JSON/DataFrame), see Response Objects.

Create an Experiment

Log pre-computed experiment results to Arize. Use this when you’ve already executed your experiment elsewhere and want to record the results. Unlike run(), this does not execute the task - it only logs existing results.
from arize.experiments import (
    ExperimentTaskFieldNames,
    EvaluationResultFieldNames,
)

experiment_runs = [
    {
        "example_id": "ex-1",
        "output": "Paris is the capital of France",
        "latency_ms": 245,
        "correctness_score": 1.0,
        "correctness_label": "correct",
    },
    {
        "example_id": "ex-2",
        "output": "William Shakespeare wrote Romeo and Juliet",
        "latency_ms": 198,
        "correctness_score": 1.0,
        "correctness_label": "correct",
    },
]

task_fields = ExperimentTaskFieldNames(
    example_id="example_id",
    output="output",
)

evaluator_columns = {
    "Correctness": EvaluationResultFieldNames(
        score="correctness_score",
        label="correctness_label",
    )
}

experiment = client.experiments.create(
    name="pre-computed-experiment",
    dataset="dataset-name-or-id",
    experiment_runs=experiment_runs,
    task_fields=task_fields,
    evaluator_columns=evaluator_columns,
)

Get an Experiment

Retrieve experiment details and metadata by name or ID. When using a name, provide dataset and optionally space to disambiguate.
experiment = client.experiments.get(
    experiment="experiment-name-or-id",
    dataset="dataset-name-or-id",  # required when using a name
)

print(experiment)

Delete an Experiment

Delete an experiment by name or ID. This operation is irreversible. There is no response from this call.
client.experiments.delete(
    experiment="experiment-name-or-id",
    dataset="dataset-name-or-id",  # required when using a name
)

print("Experiment deleted successfully")

Run an Experiment

Execute a task function across your dataset examples with automatic evaluation, then log the results to Arize. High-level flow:
  1. Resolve the dataset and download examples (cached if enabled)
  2. Execute the task and evaluators with configurable concurrency
  3. Upload results to Arize (unless in dry-run mode)
# Define your task
import openai

def answer_question(dataset_row):
    invention = dataset_row.get("attributes.input.value")  # example: "Telephone"
    openai_client = openai.OpenAI()

    response = openai_client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": f"Who invented {invention}?"}],
        max_tokens=20,
    )

    return response.choices[0].message.content

# Define evaluators (optional)
from arize.experiments import EvaluationResult

def is_correct(output, dataset_row):
    expected = dataset_row.get("attributes.output.value")
    correct = expected in output
    return EvaluationResult(
        score=int(correct),
        label="correct" if correct else "incorrect",
        explanation="Evaluator explanation here"
    )

# Run an experiment
experiment, experiment_df = client.experiments.run(
    name="prompt-v2-experiment",
    dataset="dataset-name-or-id",
    task=answer_question,
    evaluators=[is_correct],
)

print(f"Experiment: {experiment}")
print(f"Results DataFrame shape: {experiment_df.shape}")

Dry Run Mode

Execute your experiment locally without logging results to Arize. Use this to test your task and evaluators before committing to a full run.
experiment, experiment_df = client.experiments.run(
    ...,
    dry_run=True,  # Test locally without logging
    dry_run_count=10,  # Only run on first 10 examples
)

# Note: experiment is None in dry-run mode
print(f"Results DataFrame shape: {experiment_df.shape}")

Concurrency Control

Control parallelism for faster execution.
experiment, experiment_df = client.experiments.run(
    ...,
    concurrency=10,  # Run 10 examples in parallel
)

Error Handling

Stop execution on the first error encountered.
experiment, experiment_df = client.experiments.run(
    ...,
    exit_on_error=True,  # Stop on first error
)

OpenTelemetry Tracing

Set the global OpenTelemetry tracer provider for the experiment run.
experiment, experiment_df = client.experiments.run(
    ...,
    set_global_tracer_provider=True,  # Enable global OTel tracing
)

List Experiment Runs

Retrieve individual runs from an experiment with pagination support. Pass all=True to fetch all runs via Flight (ignores limit).
resp = client.experiments.list_runs(
    experiment="experiment-name-or-id",
    dataset="dataset-name-or-id",  # required when using a name
    limit=100,
)

for run in resp.experiment_runs:
    print(run)
For details on pagination, field introspection, and data conversion (to dict/JSON/DataFrame), see Response Objects.

Append Experiment Runs

Append new runs to an existing experiment. Runs are inserted in input order. Provide between 1 and 1000 runs per request. Each run must include example_id (an existing dataset example) and output; additional user-defined fields (e.g. latency_ms, model) are allowed.
new_runs = [
    {
        "example_id": "ex-3",
        "output": "Marie Curie won two Nobel Prizes",
        "latency_ms": 312,
    },
]

result = client.experiments.append_runs(
    experiment="experiment-name-or-id",
    dataset="dataset-name-or-id",  # required when using a name
    experiment_runs=new_runs,
)

print(result.run_ids)

Annotate Experiment Runs

Write human annotations to a batch of runs in an experiment. Annotations are upserted by annotation config name for each run; submitting the same name for the same run overwrites the previous value. Up to 1000 runs may be annotated per request. This method returns None on success.
from arize.experiments.types import AnnotateRecordInput, AnnotationInput

client.experiments.annotate_runs(
    experiment="experiment-name-or-id",
    dataset="dataset-name-or-id",  # optional, used to resolve experiment by name
    space="your-space-name-or-id",  # optional, used to resolve dataset by name
    annotations=[
        AnnotateRecordInput(
            record_id="your-run-id",
            values=[
                AnnotationInput(name="accuracy", label="correct", score=1.0),
                AnnotationInput(name="notes", text="Well-structured output"),
            ],
        ),
    ],
)
Learn more: Experiments Documentation