Skip to main content

What is a dataset

In Arize, a dataset is the fixed set of examples you rerun in experiments to compare changes to your app over time. It gives you a stable benchmark, so you can tell whether a prompt, model, or pipeline update actually improved results or introduced regressions.
Datasets page in Arize AX showing a list of datasets with names, row counts, created by, and timestamps

What to include

A useful dataset blends typical examples that represent everyday traffic, edge cases the app has struggled with (ambiguous inputs, long contexts, unusual formats), and known failures pulled from traces, evaluator results, or reviewer feedback. Without typical examples, you optimize for edge cases and regress on the common path; without failures, you can’t prove a fix actually holds.

Dataset types

You’ll also see datasets described by their source or purpose. These labels overlap and shift as a dataset matures:
  • Regression: Examples where the app has already failed. Use these to verify a fix holds and doesn’t quietly reintroduce the bug.
  • Golden: Inputs with hand-labeled expected outputs — a stable benchmark for comparing prompt and model changes.
  • Synthetic: Generated examples that mimic real inputs. Useful when production data is thin, sensitive, or missing the edge cases you want to stress-test.
A regression set becomes part of a golden dataset once you label the expected output for each row. Collect failures first, label them as you go, and fold in typical traffic so the benchmark isn’t just past bugs.

Dataset row schema

Each row can include input messages, expected outputs, metadata, or any other columns your task function needs. Trace-sourced rows follow the OpenInference convention (e.g., attributes.input.value). CSVs and inline examples use your own column names. Keep them consistent across sources.

Common dataset row shapes

The labels above describe why a row belongs in the dataset. The row itself should match what your task function reads. Common patterns include: Key-value rows. Use this when the task needs multiple fields such as an input, retrieved context, and an expected output.
InputContextOutput
What is Paul Graham known for?Paul Graham is an investor, entrepreneur, and computer scientist known for...Paul Graham is known for co-founding Y Combinator...
Prompt-completion pairs. Use this for the simplest single-turn completion or classification cases.
InputOutput
"do you have to have two license plates in ontario""True"
Messages or chat rows. Use this when your task expects multi-message inputs or outputs.
{
  "input": {
    "messages": [{"role": "system", "content": "You are an expert SQL assistant"}]
  },
  "output": {
    "messages": [{"role": "assistant", "content": "SELECT * FROM users;"}]
  }
}
Choose the shape that matches your task function and keep it consistent within a dataset version.

Creating a dataset

Pick the tool you work in. Each tab covers the trace-based, file-upload, and synthetic paths where they apply.
The Arize skills plugin wires dataset and trace workflows into your coding agent through the ax CLI.From traces. Combine arize-trace with arize-dataset. Try:
  • “Export error spans from the last 7 days in my production-chatbot project and create a dataset called error-regression-v1.”
  • “Find spans where annotation.hallucination.label = 'yes' over the past 14 days and save them as hallucination-examples.”
From a local file. Point the arize-dataset skill at a CSV, JSON, JSONL, or Parquet file you already have. Try:
  • “Create a dataset called billing-qa-v1 from ./data/billing_qa.csv in my support space.”
  • “Append the rows in new_edge_cases.jsonl to my existing edge-cases dataset.”
Generate synthetic rows. Have the agent draft examples for you. Try:
  • “Generate 50 synthetic billing support tickets with query and expected_category fields, then save as support-synthetic-v1.”
  • “Draft 20 adversarial inputs targeting prompt injection for my chat agent and save as adversarial-v1.”
Coding agent running Arize skills via the ax CLI to create datasets from traces and generated examples

Managing your dataset

Add, edit, export, or delete rows as the app evolves. Datasets are versioned, and appends land in the latest version in place.
Use the arize-dataset skill to append, export, or inspect datasets without leaving your editor. Try asking your agent:
  • “Append the rows in new_examples.csv to my support-regression dataset.”
  • “Export the latest version of my support-tickets dataset so I can review it offline.”
  • “Show me the schema and the first five rows of my support-qa-v1 dataset.”
Coding agent running the arize-dataset skill via the ax CLI to append new examples to an existing dataset without leaving the editor

Auto add to dataset

Once the dataset exists, set up rules that automatically add spans when they match your criteria. Auto-add rules keep the dataset current with what’s actually happening in production, without manual curation.

From evaluator labels

After you’ve set up an evaluator on a project, add a post-processing step that routes spans to a dataset based on the evaluator’s result. See Create evaluators for evaluator setup, then edit the evaluator configuration for your task.
Task configuration page in Arize AX showing the evaluator selection dropdown
Select Auto Add Spans to Dataset, then specify which eval labels should trigger the addition. For example, all spans where Correctness is Incorrect, or any span where the eval label is not null.
Evaluator configuration panel in Arize AX with the 'Auto Add Spans to Dataset' option selected and filter criteria entered

From filter criteria

You can also auto-add spans that match basic filter criteria without an evaluator, such as high token counts, latency above a threshold, or a specific tool call. Use this when the signal is structural rather than labeled.

Next step

Your dataset is in place. Now measure whether prompt, model, or pipeline changes actually improve your AI.

Set up an experiment

Define your baseline, decide what to change, and choose Playground or code.

Further reading