Compare Experiments: two Playground runs side by side with per-row latency, token counts, model outputs, and evaluator labels

### On a dataset When modifying a prompt in the **Playground**, you can test your new prompt across a dataset of examples to validate that the model is hill climbing in terms of performance across challenging examples, without regressing on core business use cases. Follow [**Build a dataset**](/ax/improve/build-a-dataset) to upload your dataset to **Arize AX**. Go back to the **Prompt Playground** and choose your dataset from the **Select a Dataset** dropdown. **Load a prompt** from **Prompt Hub** using the **Select a template from Prompt Hub** dropdown. **Or** fill in a new prompt (see [Build a prompt](/ax/improve/build-a-prompt)). Include variables from your dataset in the prompt inside curly braces (for example `` `{destination}` ``). [**Evaluators**](/ax/evaluate/evaluators) score each row. **Attach an evaluator:** [code-based](/ax/evaluate/evaluators/code-evaluations) for deterministic checks, or [**LLM-as-a-Judge**](/ax/evaluate/evaluators/llm-as-a-judge) for qualitative scoring. Click **Run**. The Playground fills each row's variables, calls your model, and scores the output. Open **View Experiment** to compare results and look for patterns. For each row, the Playground substitutes values into the template, calls your model, and runs every attached [evaluator](/ax/evaluate/evaluators) on the generated output.

Prompt Playground with a hub prompt template, topic variable in the user message, haiku topics dataset selected, and per-row values in Input Variables