> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.site/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# CI/CD with experiments

Setting up CI/CD pipelines for LLMs helps you maintain control as your applications evolve. Just like in traditional software, automated testing is crucial to catch issues early. With Arize, you can create [experiments](/ax/develop/datasets-and-experiments/run-experiment) that automatically validate changes—whether it's a tweak to a prompt, model, or function—using a curated dataset and your preferred evaluation method. These tests can be integrated with [GitHub Actions,](/ax/develop/datasets-and-experiments/ci-cd-for-automated-experiments/github-action-basics) or [GitLab CI/CD](/ax/develop/datasets-and-experiments/ci-cd-for-automated-experiments/gitlab-ci-cd-basics) so they run automatically when you push a change, giving you confidence that your updates are solid without the need for manual testing.

# Setting Up an Automated Experiment

This guide will walk you through setting up an automated experiment using our platform. It includes preparing your experiment file, defining the task and evaluator, and running the experiment.

<Info>
  To test locally be sure to install the dependencies: `pip install -q arize==7.19.0 arize-phoenix==4.21.0 nest_asyncio packaging openai 'gql[all]'`
</Info>

## 1. Define the Experiment File

The experiment file organizes all components necessary for conducting your experiment. It includes sections for the dataset, task, and evaluator.

**Dataset**

The first step is to set up and retrieve your [dataset](/ax/develop/datasets#datasets):

<CodeGroup>
  ```python Python SDK v8 theme={null}
  from arize import ArizeClient

  client = ArizeClient(api_key="your-arize-api-key")

  # Get the current dataset metadata
  dataset = client.datasets.get(dataset_id=dataset_id)

  # Get the dataset examples
  examples_response = client.datasets.list_examples(
      dataset_id=dataset_id,
      dataset_version_id=dataset_version_id,  # Optional, defaults to latest
  )
  dataset_df = examples_response.to_df()
  ```

  ```python Python SDK v7 theme={null}
  from arize.experimental.datasets import ArizeDatasetsClient

  arize_client = ArizeDatasetsClient(developer_key="your-arize-api-key")

  # Get the current dataset version
  dataset = arize_client.get_dataset(
      space_id="your-arize-space-id", dataset_id=dataset_id, dataset_version="2024-08-11 23:01:04"
  )
  ```
</CodeGroup>

**Task**

Define the tasks that your model needs to perform. Typically, the task replicates the LLM functionality you're aiming to test. In this example, the focus is on whether the router selected the correct function, so the task involves returning the tool call:

```python theme={null}
def task(example) -> str:
    ## You can import directly from your repo to automatically grab the latest version
    from prompt_func.search.search_router import ROUTER_TEMPLATE
    print("running task")
    prompt_vars = json.loads(
        example.dataset_row["attributes.llm.prompt_template.variables"]
    )

    response = client.chat.completions.create(
        model=TASK_MODEL,
        temperature=0,
        messages=[
            {"role": "system", "content": ROUTER_TEMPLATE},
        ],
        tools=avail_tools,
    )
    tool_response = response.choices[0].message.tool_calls
    return tool_response

def run_task(example) -> str:
    return task(example)
```

**Evaluator**

An evaluator serves as the measure of success for your experiment. You can define multiple evaluators, ranging from [LLM-based judges](/ax/evaluate/evaluators/llm-as-a-judge) to [code-based evaluations](/ax/evaluate/evaluators/code-evaluations). The evaluator is central to testing and validating the outcomes of your experiment:

<CodeGroup>
  ```python Python SDK v8 theme={null}
  import pandas as pd
  from phoenix.evals import llm_classify, OpenAIModel
  from arize.experiments import EvaluationResult

  def function_selection(output, dataset_row, **kwargs) -> EvaluationResult:
      print("evaluating outputs")
      expected_output = dataset_row["attributes.llm.output_messages"]
      df_in = pd.DataFrame(
          {"selected_output": output, "expected_output": expected_output}, index=[0]
      )
      rails = ["incorrect", "correct"]
      expect_df = llm_classify(
          dataframe=df_in,
          template=EVALUATOR_TEMPLATE,
          model=OpenAIModel(model="gpt-4o-mini", api_key=OPENAI_API_KEY),
          rails=rails,
          provide_explanation=True,
      )

      label = expect_df["label"][0]
      score = 1 if label == "correct" else 0
      explanation = expect_df["explanation"][0]
      return EvaluationResult(score=score, label=label, explanation=explanation)
  ```

  ```python Python SDK v7 theme={null}
  import pandas as pd
  from phoenix.evals import llm_classify, OpenAIModel
  from arize.experimental.datasets.experiments.evaluators.base import EvaluationResult

  def function_selection(output, dataset_row, **kwargs) -> EvaluationResult:
      print("evaluating outputs")
      expected_output = dataset_row["attributes.llm.output_messages"]
      df_in = pd.DataFrame(
          {"selected_output": output, "expected_output": expected_output}, index=[0]
      )
      rails = ["incorrect", "correct"]
      expect_df = llm_classify(
          dataframe=df_in,
          template=EVALUATOR_TEMPLATE,
          model=OpenAIModel(model="gpt-4o-mini", api_key=OPENAI_API_KEY),
          rails=rails,
          provide_explanation=True,
      )

      label = expect_df["label"][0]
      score = 1 if label == "correct" else 0
      explanation = expect_df["explanation"][0]
      return EvaluationResult(score=score, label=label, explanation=explanation)
  ```
</CodeGroup>

### Run the Experiment

Configure and initiate your experiment using  `run_experiment`:

<CodeGroup>
  ```python Python SDK v8 theme={null}
  experiment, experiment_df = client.experiments.run(
      name="Your_Experiment_Name",
      dataset_id=dataset_id,
      task=run_task,
      evaluators=[function_selection],
  )
  ```

  ```python Python SDK v7 theme={null}
  experiment = arize_client.run_experiment(
      space_id="your-arize-space-id",
      dataset_id=dataset_id,
      task=run_task,
      evaluators=[function_selection],
      experiment_name="Your_Experiment_Name"
  )
  ```
</CodeGroup>

### Advanced Experiment Management

You can retrieve information about existing experiments using a GraphQL query. This is useful for tracking experiment history and performance.

```python theme={null}
from gql import Client, gql
from gql.transport.requests import RequestsHTTPTransport

def fetch_experiment_details(gql_client, dataset_id):
    experiments_query = gql(
        """
        query getExperimentDetails($DatasetId:ID!){
        node(id: $DatasetId) {
            ... on Dataset {
            name
            experiments(first: 1){
                edges{
                node{
                    name
                    createdAt
                    evaluationScoreMetrics{
                        name
                        meanScore
                    }
                }
                }
            }
            }
        }
        }
        """
    )

    params = {"DatasetId": dataset_id}
    response = gql_client.execute(experiments_query, params)
    experiments = response["node"]["experiments"]["edges"]
    
    experiments_list = []
    for experiment in experiments:
        node = experiment["node"]
        experiment_name = node["name"]
        for metric in node["evaluationScoreMetrics"]:
            experiments_list.append([
                experiment_name,
                metric["name"],
                metric["meanScore"]
            ])
    
    return experiments_list
```

This function returns a list of experiments with their names, metric names, and mean scores.

**Determine Experiment Success**

You can use the mean score from an experiment to automatically determine if it passed or failed:

```python theme={null}
def determine_experiment_success(experiment_result):
    success = experiment_result > 0.7
    sys.exit(0 if success else 1)
```

This function exits with code 0 if the experiment is successful (score > 0.7) or code 1 if it fails.

**Auto-increment Experiment Names**

To ensure unique experiment names, you can automatically increment the version number:

```python theme={null}
def increment_experiment_name(experiment_name):
    ## example name: AI Search V1.1
    match = re.search(r"V(\d+)\.(\d+)", experiment_name)
    if not match:
        return experiment_name

    major, minor = map(int, match.groups())
    new_version = f"V{major}.{minor + 1}"
    return re.sub(r"V\d+\.\d+", new_version, experiment_name)
```

## 2. Define Workflow (CI/CD) File

### Github Actions:

* Workflow files are stored in the `.github/workflows` directory of your repository.
* Workflow files use YAML syntax and have a `.yml` extension

## Example WorkFlow File:

```yaml theme={null}
name: AI Search - Correctness Check

on:
  push:
    paths:
      - copilot/search


jobs:
  run-script:
    runs-on: ubuntu-latest
    env:
      OPENAI_KEY: ${{ secrets.OPENAI_KEY }}  

    steps:
    - name: Checkout repository
      uses: actions/checkout@v2

    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.10'

    - name: Install dependencies
      run: |
        pip install -q arize==7.36.0 arize-phoenix==4.29.0 nest_asyncio packaging openai 'gql[all]'
    - name: Run script
      run: python ./copilot/experiments/ai_search_test.py
```

### **Gitlab CI/CD**

GitLab CI/CD pipelines are defined in a `.gitlab-ci.yml` file stored in the root of your repository. You can use YAML syntax to define your pipeline.

**Example `.gitlab-ci.yml` File:**

```yaml theme={null}
stages:
  - test

variables:
  # These variables need to be defined in GitLab CI/CD settings
  # The $ syntax is how GitLab references variables
  OPENAI_API_KEY: $OPENAI_API_KEY
  ARIZE_API_KEY: $ARIZE_API_KEY
  SPACE_ID: $SPACE_ID
  DATASET_ID: $DATASET_ID

llm-experiment-job:
  stage: test
  image: python:3.10
  # The 'only' directive specifies when this job should run
  # This will run for merge requests that change files in copilot/search
  only:
    refs:
      - merge_requests
    changes:
      - copilot/search/**/*
  script:
    - pip install -q arize==7.36.0 arize-phoenix==4.29.0 nest_asyncio packaging openai 'gql[all]'
    - python ./copilot/experiments/ai_search_test.py
  artifacts:
    paths:
      - experiment_results.json
    expire_in: 1 week
```