> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.site/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Create a dataset

> Create datasets from CSV, spans, code, or synthetic generation

We have four ways of loading data into a dataset

1. [Create a dataset from CSV](/ax/develop/datasets/how-to-datasets#create-a-dataset-from-csv)
2. [Create a dataset from your spans](/ax/develop/datasets/how-to-datasets#create-a-dataset-from-your-spans)
3. [Create a dataset with code](/ax/develop/datasets/how-to-datasets#create-a-dataset-with-code)
4. [Create a synthetic dataset](/ax/develop/datasets/how-to-datasets#create-a-synthetic-dataset)

# Create a dataset from CSV

You can upload CSVs as a dataset in Arize AX. Your columns in the file can be accessed in experiments or in prompt playground.

<Frame>
  <video
    src="https://storage.googleapis.com/arize-phoenix-assets/assets/videos/create-dataset-csv.mp4"
    width="100%"
    height="100%"
    style={{
  display: 'block',
  objectFit: 'fill',
  backgroundColor: 'transparent'
}}
    controls
    autoPlay
    muted
    loop
  />
</Frame>

***

# Create a dataset from your spans

Arize AX supports adding spans from your projects to datasets. The trace data from an application with errors or faulty evals can become fuel for ongoing development. You can use our [tracing filters](/ax/observe/tracing/how-to-query-traces/filter-traces) or ✨[AI Search](/ax/alyx) to curate your dataset.

<Frame>
  <video
    src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/create_dataset_spans.mp4"
    width="100%"
    height="100%"
    style={{
  display: 'block',
  objectFit: 'fill',
  backgroundColor: 'transparent'
}}
    controls
    autoPlay
    muted
    loop
  />
</Frame>

***

# Create a dataset with code

If you'd like to create your datasets programmatically, you can using our [clients](https://arize-client-python.readthedocs.io/en/latest/llm-api/datasets.html) to create, update, and delete datasets.

To start let's install the packages we need:

<CodeGroup>
  ```bash Python SDK v8 theme={null}
  pip install --pre arize pandas
  ```

  ```bash Python SDK v7 theme={null}
  pip install "arize[Datasets]" pandas
  ```
</CodeGroup>

You can get your API key by navigating to the "Settings" page.

<Frame>
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/datasets-get-api-key.png" alt="" />
</Frame>

Let's setup the Arize Dataset Client to create or update a dataset. See [here](/api-clients/python/overview) for API reference.

<CodeGroup>
  ```python Python SDK v8 theme={null}
  from arize import ArizeClient

  client = ArizeClient(api_key="your-arize-api-key")
  ```

  ```python Python SDK v7 theme={null}
  from arize.experimental.datasets import ArizeDatasetsClient

  client = ArizeDatasetsClient(api_key="your-arize-api-key")
  ```
</CodeGroup>

You can create many different kinds of datasets. The examples below are sorted by complexity.

<Tabs>
  <Tab title="Simple dataset">
    This is a simple dataset with just string values for the columns.

    <CodeGroup>
      ```python Python SDK v8 theme={null}
      import pandas as pd

      # Example dataset
      inventions_dataset = pd.DataFrame({
          "attributes.input.value": ["Telephone", "Light Bulb"],
          "attributes.output.value": ["Alexander Graham Bell", "Thomas Edison"],
      })

      dataset = client.datasets.create(
          space_id="your-arize-space-id",
          name="test_invention_dataset",
          examples=inventions_dataset,
      )
      dataset_id = dataset.id
      ```

      ```python Python SDK v7 theme={null}
      import pandas as pd
      from arize.experimental.datasets.utils.constants import GENERATIVE

      # Example dataset
      inventions_dataset = pd.DataFrame({
          "attributes.input.value": ["Telephone", "Light Bulb"],
          "attributes.output.value": ["Alexander Graham Bell", "Thomas Edison"],
      })

      dataset_id = client.create_dataset(
          space_id="your-arize-space-id",
          dataset_name="test_invention_dataset",
          dataset_type=GENERATIVE,
          data=inventions_dataset
      )
      ```
    </CodeGroup>
  </Tab>

  <Tab title="Dataset with prompt template & variables">
    The datasets in Arize AX can support flexible columns. You can also add the prompt template and variables to each row.

    In this example, we are setting `attributes.llm.prompt_template.variables`. We are using the [OpenInference](https://github.com/Arize-ai/openinference) semantic conventions and Arize AX will automatically import these as input variables.

    <CodeGroup>
      ```python Python SDK v8 theme={null}
      import pandas as pd
      import json

      PROMPT_TEMPLATE = """
      You are an expert in the history of technological inventions.
      Identify the individual or organization that created the following invention.

      Invention: {invention}
      """

      data = [
          {
              "attributes.llm.prompt_template.template": PROMPT_TEMPLATE,
              "attributes.llm.prompt_template.variables": json.dumps({
                  "invention": "Telephone",
              }),
              "attributes.output.value": "Alexander Graham Bell"
          }
      ]

      df = pd.DataFrame(data)

      dataset = client.datasets.create(
          space_id="your-arize-space-id",
          name="prompt_invention_dataset",
          examples=df,
      )
      dataset_id = dataset.id
      ```

      ```python Python SDK v7 theme={null}
      import pandas as pd
      import json
      from arize.experimental.datasets.utils.constants import GENERATIVE

      PROMPT_TEMPLATE = """
      You are an expert in the history of technological inventions.
      Identify the individual or organization that created the following invention.

      Invention: {invention}
      """

      data = [
          {
              "attributes.llm.prompt_template.template": PROMPT_TEMPLATE,
              "attributes.llm.prompt_template.variables": json.dumps({
                  "invention": "Telephone",
              }),
              "attributes.output.value": "Alexander Graham Bell"
          }
      ]

      df = pd.DataFrame(data)

      dataset_id = client.create_dataset(
          space_id="your-arize-space-id",
          dataset_name="prompt_invention_dataset",
          dataset_type=GENERATIVE,
          data=df
      )
      ```
    </CodeGroup>
  </Tab>
</Tabs>

***

# Create a synthetic dataset

In some cases, the data you have might not be enough to cover all the scenarios you want to test. This is where you can use **Alyx** for Synthetic Dataset Generation:

* **Suggested Prompt:** “Generate a synthetic dataset of 20 examples that cover...”
* **Use When:** You need labeled examples to test, fine-tune, or evaluate prompts without relying on real user data.\
  **Description:** Creates artificial examples that mimic real-world scenarios enabling faster experimentation

You can save your generated examples as a dataset and test them directly in the playground.

<Frame>
  <video
    src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/generate-synthetic-dataset.mp4"
    width="100%"
    height="100%"
    style={{
  display: 'block',
  objectFit: 'fill',
  backgroundColor: 'transparent'
}}
    controls
    autoPlay
    muted
    loop
  />
</Frame>
