> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.site/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation Levels — Span, Trace, and Session

> Span, trace, and session evaluations answer different shapes of question. Picking the right level is the first design decision when building an evaluator.

The first design decision when building an evaluator is the **level** it runs at. Span, trace, and session evaluators look at different slices of data — and answer different shapes of question. Picking the wrong level means the evaluator either can't see enough data to judge, or sees more than it needs and loses focus.

This page covers the conceptual distinction. For the language of spans, traces, and sessions themselves, see [Signals, spans, traces, and sessions](/ax/concepts/otel-openinference/signals) in the OpenTelemetry concepts section.

# The three levels

| Level       | Scope                                                                                | The shape of question it answers                  |
| :---------- | :----------------------------------------------------------------------------------- | :------------------------------------------------ |
| **Span**    | One unit of work — a single LLM call, tool invocation, or retrieval.                 | *Was* ***this*** *thing done correctly?*          |
| **Trace**   | The full tree of spans for one request, root to leaf.                                | *Was the* ***sequence*** *of work the right one?* |
| **Session** | A collection of traces sharing a `session.id` — typically a multi-turn conversation. | *Did the* ***conversation*** *go well overall?*   |

The level you pick is determined by where the data you need to judge actually lives. If everything you need to answer the question is in one span, use a span evaluator. If you need to look across multiple spans within a single request, use a trace evaluator. If you need to look across multiple requests in the same conversation, use a session evaluator.

# Span evaluators

Span evaluators ask questions about a single unit of work. They are the most common kind and the easiest to reason about — the evaluator sees one span, has access to that span's attributes, and emits a score.

Typical questions a span evaluator answers:

* Was the right tool selected for this call?
* Were the tool parameters extracted correctly?
* Was this individual LLM response factually correct?
* Was this retrieval relevant to the query that triggered it?
* Did this guardrail decision make sense?

The data they read lives entirely under one span's attributes — `attributes.llm.input_messages`, `attributes.llm.output_messages`, `attributes.tool.name`, `attributes.retrieval.documents`, and so on. See [Semantic conventions](/ax/concepts/otel-openinference/semantic-conventions) for the full attribute namespace.

# Trace evaluators

Trace evaluators ask questions about the *shape* of a request — how the application got from the user's input to the final output. They see every span in the trace and judge the whole path.

Typical questions a trace evaluator answers:

* Was the order of tool calls correct?
* Was the agent's trajectory efficient, or did it loop?
* Did the application skip a step it should have taken?
* Was the chain of LLM calls + retrievals + tool calls coherent?

Trace evaluators are what you reach for when individual spans look fine but the way they fit together doesn't. A span-level "tool call correctness" eval might pass on every tool call individually while the agent calls tools in a nonsensical order. Only a trace evaluator catches that.

# Session evaluators

Session evaluators look at a whole conversation. They see every trace that shares a `session.id` — typically every turn of a multi-turn chat — and judge the conversation as a unit.

Typical questions a session evaluator answers:

* Is the conversation coherent across turns?
* Did the assistant maintain a consistent tone?
* Did the conversation reach resolution, or did it stall?
* Did the user appear frustrated by the end?

Session-level questions are the ones individual traces can't answer because they require the *history*. A single turn of a chat might be perfectly fine on its own and still be unhelpful in the context of the previous five turns.

Sessions only exist when your application is instrumented to set `session.id`. See [OpenInference context managers](/ax/concepts/otel-openinference/context-managers) for how that gets set.

# Picking the level

A simple decision tree:

1. Can you answer the question by looking at one span in isolation? → **Span**
2. Do you need to compare spans within the same request? → **Trace**
3. Do you need to look across multiple requests in the same conversation? → **Session**

The level isn't a property of the evaluator template — it's a property of the **evaluator task**. The underlying knob is the `--data-granularity` flag (values: `span`, `trace`, `session`). All three are first-class options.

A few subtleties to watch for:

* **Scope creep at design time.** It's tempting to use a session evaluator "to be safe" when a span evaluator would do. Wider scope means more tokens passed to the judge, higher cost, and lower signal because the judge has to pick the needle out of more hay. Pick the narrowest level that can answer the question.
* **Mixing levels for one application.** Most production applications need evaluators at multiple levels — tool-correctness at the span level, agent-trajectory at the trace level, conversation-tone at the session level. They run in parallel and don't interfere with each other.
* **Cost and latency scale with level.** A span evaluator might see a few hundred tokens of context. A session evaluator might see tens of thousands. This matters when you're picking the judge model and the sampling rate.

***

## Next step

Levels tell you *where* to look. The next decision is what kind of evaluator does the looking:

<Card title="Next: Evaluator Types" icon="arrow-right" href="/ax/concepts/evaluators/evaluator-types" />