Tasks - Arize AX Docs

The tasks client methods are currently in ALPHA. The API may change without notice. A one-time warning is emitted on first use.

Create evaluation tasks that continuously or on-demand score spans in a project, or evaluate examples in a dataset using your LLM-as-judge evaluators.

Key Capabilities

Create project-based tasks that run continuously against live spans
Create dataset-based tasks that evaluate experiment results
Create run_experiment tasks that drive LLM calls on the server
Trigger on-demand task runs with custom data windows
Poll task runs until completion with configurable timeout
Cancel in-progress runs
List and filter task runs by status

List Tasks

List tasks you have access to, with optional filtering by space, project, dataset, or type.

resp = client.tasks.list(
    space="your-space-name-or-id",  # optional
    limit=50,
)

for task in resp.tasks:
    print(task.id, task.name)

Filter by task type:

resp = client.tasks.list(
    space="your-space-name-or-id",
    task_type="template_evaluation",
)

Valid values for task_type are "template_evaluation", "code_evaluation", and "run_experiment". For details on pagination, field introspection, and data conversion (to dict/JSON/DataFrame), see Response Objects.

Create an Evaluation Task

Create a new evaluation task. Evaluation tasks can target either a project (live spans) or a dataset (experiment results).

Project-Based Task

A project-based task continuously evaluates incoming spans. Set is_continuous=True to run the task on every new span, or False to run it only on demand.

from arize.tasks.types import BaseEvaluationTaskRequestEvaluatorsInner

task = client.tasks.create_evaluation_task(
    name="Relevance Monitor",
    task_type="template_evaluation",
    project="your-project-name-or-id",
    evaluators=[
        BaseEvaluationTaskRequestEvaluatorsInner(
            evaluator_id="your-evaluator-id",
        ),
    ],
    is_continuous=True,
    sampling_rate=0.1,  # Evaluate 10% of spans
)

print(task.id)

Dataset-Based Task

A dataset-based task evaluates examples from one or more experiments. At least one experiment_ids entry is required.

task = client.tasks.create_evaluation_task(
    name="Experiment Evaluation",
    task_type="template_evaluation",
    dataset="your-dataset-name-or-id",
    experiment_ids=["experiment-id-1", "experiment-id-2"],
    evaluators=[
        BaseEvaluationTaskRequestEvaluatorsInner(
            evaluator_id="your-evaluator-id",
        ),
    ],
    is_continuous=False,
)

print(task.id)

Column Mappings and Filters

Each evaluator in the task can have its own column mappings (to map template variables to span attribute names) and a per-evaluator query filter.

task = client.tasks.create_evaluation_task(
    name="Custom Relevance",
    task_type="template_evaluation",
    project="your-project-name-or-id",
    evaluators=[
        BaseEvaluationTaskRequestEvaluatorsInner(
            evaluator_id="your-evaluator-id",
            column_mappings={"user_query": "input.value"},
            query_filter="status_code = 'OK'",
        ),
    ],
    query_filter="latency_ms < 5000",  # Task-level filter (AND-ed with evaluator filter)
    is_continuous=True,
)

Parameter reference:

Parameter	Type	Description
`name`	`str`	Task name. Must be unique within the space.
`task_type`	`str`	`"template_evaluation"` or `"code_evaluation"`.
`evaluators`	`list`	List of evaluators to attach. At least one required.
`project`	`str`	Target project name or ID. Required when `dataset` is not provided.
`dataset`	`str`	Target dataset name or ID. Required when `project` is not provided.
`space`	`str`	Space name or ID used to disambiguate name-based resolution for `project` and `dataset`.
`experiment_ids`	`list[str]`	Required (at least one) when `dataset` is provided.
`sampling_rate`	`float`	Fraction of spans to evaluate (0–1). Project-based tasks only.
`is_continuous`	`bool`	`True` to run on every new span; `False` for on-demand only.
`query_filter`	`str`	Task-level SQL-style filter applied to all evaluators.

Create a Run-Experiment Task

A run_experiment task drives all LLM calls on the server using the AI integration specified in run_configuration — no local callable is required.

from arize.tasks.types import LlmGenerationRunConfig

task = client.tasks.create_run_experiment_task(
    name="Nightly QA Run",
    dataset="your-dataset-name-or-id",
    space="your-space-name-or-id",  # required when dataset is a name
    run_configuration=LlmGenerationRunConfig(
        # provider/model/prompt configuration for the server-driven run
        # ...
    ),
)

print(task.id)

The method also accepts a TemplateEvaluationRunConfig instance or a plain dict matching one of those schemas; the SDK wraps it for you.

Get a Task

Retrieve a task by name or ID. When using a name, provide space to disambiguate.

task = client.tasks.get(
    task="your-task-name-or-id",
    space="your-space-name-or-id",  # required when using a name
)

print(task.id, task.name)

Update a Task

Update mutable fields on an existing task. At least one update field must be provided. Pass query_filter=None to clear the existing filter; omit any other argument to leave it unchanged.

task = client.tasks.update(
    task="your-task-name-or-id",
    space="your-space-name-or-id",  # required when using a name
    name="Relevance Monitor v2",
    sampling_rate=0.25,  # project-based tasks only
)

print(task.id, task.name)

Delete a Task

Delete a task and its associated configuration. This operation is irreversible.

client.tasks.delete(
    task="your-task-name-or-id",
    space="your-space-name-or-id",  # required when using a name
)

print("Task deleted successfully")

Task Runs

Trigger a Run

Trigger an on-demand run for a task. The run starts in "pending" status. The accepted parameters depend on the task’s type. Evaluation tasks (template_evaluation / code_evaluation):

from datetime import datetime

run = client.tasks.trigger_run(
    task="your-task-name-or-id",
    data_start_time=datetime(2024, 1, 1),
    data_end_time=datetime(2024, 2, 1),
)

print(run.id, run.status)  # e.g. "run-abc123", "pending"

Parameter	Type	Default	Description
`task`	`str`	required	Task name or ID to trigger.
`space`	`str`	None	Space name or ID used to disambiguate the task lookup. Recommended when resolving by name.
`data_start_time`	`datetime`	None	Start of data window to evaluate.
`data_end_time`	`datetime`	now	End of data window. Defaults to the current time.
`max_spans`	`int`	10 000	Maximum number of spans to process.
`override_evaluations`	`bool`	`False`	Re-evaluate data that already has labels.
`experiment_ids`	`list[str]`	None	Experiment IDs to run against (dataset-based tasks only).

run_experiment tasks:

run = client.tasks.trigger_run(
    task="your-run-experiment-task",
    experiment_name="qa-run-2024-01-15",  # required: display name for the experiment
    max_examples=100,                     # optional cap
)

Parameter	Type	Default	Description
`task`	`str`	required	Task name or ID to trigger.
`space`	`str`	None	Space name or ID used to disambiguate the task lookup.
`experiment_name`	`str`	required	Display name for the experiment to be created. Must be unique within the dataset.
`dataset_version_id`	`str`	latest	Dataset version global ID. Defaults to the latest version.
`max_examples`	`int`	None	Maximum number of examples to run. When omitted, all examples are used. Mutually exclusive with `example_ids`.
`example_ids`	`list[str]`	None	Specific dataset example global IDs to run against. Mutually exclusive with `max_examples`.
`tracing_metadata`	`dict[str, Any]`	None	Arbitrary key-value metadata attached to the run’s traces.
`evaluation_task_ids`	`list[str]`	None	Task global IDs of evaluation tasks to trigger after the experiment run completes.

List Runs

List runs for a task with optional status filtering.

resp = client.tasks.list_runs(
    task="your-task-name-or-id",
    limit=20,
)

for run in resp.task_runs:
    print(run.id, run.status)

Filter to only completed runs:

resp = client.tasks.list_runs(
    task="your-task-name-or-id",
    status="completed",
)

Valid status values: "pending", "running", "completed", "failed", "cancelled".

Get a Run

Retrieve a specific run by its ID.

run = client.tasks.get_run(run_id="your-run-id")

print(run.id, run.status)

Cancel a Run

Cancel a run that is currently "pending" or "running".

run = client.tasks.cancel_run(run_id="your-run-id")

print(run.status)  # "cancelled"

Wait for a Run

Poll a run until it reaches a terminal state ("completed", "failed", or "cancelled").

run = client.tasks.wait_for_run(
    run_id="your-run-id",
    poll_interval=5,   # Check every 5 seconds (default)
    timeout=600,       # Give up after 10 minutes (default)
)

print(run.status)  # "completed", "failed", or "cancelled"

Raises TimeoutError if the run does not complete within timeout seconds.

End-to-End: Trigger and Wait

# Trigger an on-demand run
run = client.tasks.trigger_run(task="your-task-name-or-id")

# Block until the run finishes
run = client.tasks.wait_for_run(run_id=run.id)

if run.status == "completed":
    print("Task run completed successfully")
elif run.status == "failed":
    print("Task run failed")

Learn more: Online Evaluations Documentation

​Key Capabilities

​List Tasks

​Create an Evaluation Task

​Project-Based Task

​Dataset-Based Task

​Column Mappings and Filters

​Create a Run-Experiment Task

​Get a Task

​Update a Task

​Delete a Task

​Task Runs

​Trigger a Run

​List Runs

​Get a Run

​Cancel a Run

​Wait for a Run

​End-to-End: Trigger and Wait

Key Capabilities

List Tasks

Create an Evaluation Task

Project-Based Task

Dataset-Based Task

Column Mappings and Filters

Create a Run-Experiment Task

Get a Task

Update a Task

Delete a Task

Task Runs

Trigger a Run

List Runs

Get a Run

Cancel a Run

Wait for a Run

End-to-End: Trigger and Wait