5. Tutorial 5: Run Your First Experiment

Note

Tutorial (15-20 minutes)

This is a hands-on tutorial that takes you step-by-step through running your first experiment with HoneyHive. You’ll create a working example and see results in the dashboard.

5.1. What You’ll Learn

By the end of this tutorial, you’ll know how to:

Run an experiment with evaluate()
Structure test data with inputs and ground truths
Create evaluators to automatically score outputs
View metrics and scores in HoneyHive dashboard
Compare different versions of your function

5.2. What You’ll Build

A complete question-answering experiment with automated evaluation. You’ll:

Create a baseline QA function
Test it against a dataset
Add evaluators to automatically score outputs
Compare baseline vs improved version using metrics
View results and metrics in HoneyHive dashboard

5.3. Prerequisites

Before starting this tutorial, you should:

Complete Set Up Your First Tracer
Have Python 3.11 or higher installed
Have a HoneyHive API key
Basic familiarity with Python dictionaries

If you haven’t set up the SDK yet, go back to Tutorial 1.

5.4. Step 1: Install and Setup

First, create a new Python file for this tutorial:

touch my_first_experiment.py

Add the necessary imports and setup:

# my_first_experiment.py
import os
from typing import Any, Dict
from honeyhive.experiments import evaluate

# Set your API key
os.environ["HH_API_KEY"] = "your-api-key-here"
os.environ["HH_PROJECT"] = "experiments-tutorial"

Tip

Store your API key in a .env file instead of hardcoding it. See Production Deployment Guide for production best practices.

5.5. Step 2: Define Your Function

Create a simple function that answers questions. This will be the function we test in our experiment:

def answer_question(datapoint: Dict[str, Any]) -> Dict[str, Any]:
    """Answer a trivia question.

    This is the function we'll test in our experiment.

    Args:
        datapoint: Contains 'inputs' with the question

    Returns:
        Dictionary with the answer

    Note:
        The evaluation function can also accept a 'tracer' parameter if you need
        to access the tracer instance within your function for manual tracing:

        def answer_question(datapoint: Dict[str, Any], tracer: HoneyHiveTracer) -> Dict[str, Any]:
            # Use tracer for custom spans, enrichment, etc.
            pass
    """
    inputs = datapoint.get("inputs", {})
    question = inputs.get("question", "")

    # Simple logic: check for keywords
    # (In real use, you'd call an LLM here)
    if "capital" in question.lower() and "france" in question.lower():
        answer = "Paris"
    elif "2+2" in question:
        answer = "4"
    elif "color" in question.lower() and "sky" in question.lower():
        answer = "blue"
    else:
        answer = "I don't know"

    return {
        "answer": answer,
        "confidence": "high" if answer != "I don't know" else "low"
    }

Note

This example uses simple logic for demonstration. In a real experiment, you’d call an LLM API (OpenAI, Anthropic, etc.) inside this function.

5.6. Step 3: Create Your Test Dataset

Define a dataset with questions and expected answers:

dataset = [
    {
        "inputs": {
            "question": "What is the capital of France?"
        },
        "ground_truth": {
            "answer": "Paris",
            "category": "geography"
        }
    },
    {
        "inputs": {
            "question": "What is 2+2?"
        },
        "ground_truth": {
            "answer": "4",
            "category": "math"
        }
    },
    {
        "inputs": {
            "question": "What color is the sky?"
        },
        "ground_truth": {
            "answer": "blue",
            "category": "science"
        }
    }
]

Understanding the Structure:

inputs: What your function receives
ground_truth: The expected correct answers (used for evaluation)

5.7. Step 4: Run Your Experiment

Now run the experiment:

result = evaluate(
    function=answer_question,
    dataset=dataset,
    name="qa-baseline-v1",
    verbose=True  # Show progress
)





print(f"\n✅ Experiment complete!")
print(f"📊 Run ID: {result.run_id}")
print(f"📈 Status: {result.status}")

Run it:

python my_first_experiment.py

Expected Output:

Processing datapoint 1/3...
Processing datapoint 2/3...
Processing datapoint 3/3...

✅ Experiment complete!
📊 Run ID: run_abc123...
📈 Status: completed

5.8. Step 5: View Results in Dashboard

Go to HoneyHive Experiments Dashboard
Navigate to your project: experiments-tutorial
Find your run: qa-baseline-v1

Click to view: - Session traces for each question - Function outputs - Ground truths - Session metadata

What You’ll See:

3 sessions (one per datapoint)
Each session shows inputs and outputs
Ground truths displayed for comparison
Session names include your experiment name

5.9. Step 6: Add Evaluators for Automated Scoring

Viewing results manually is helpful, but let’s add evaluators to automatically score our function’s outputs:

def exact_match_evaluator(
    outputs: Dict[str, Any],
    inputs: Dict[str, Any],
    ground_truth: Dict[str, Any]
) -> float:
    """Check if answer exactly matches ground truth.

    Args:
        outputs: Function's output (from answer_question)
        inputs: Original inputs (not used here)
        ground_truth: Expected outputs

    Returns:
        1.0 if exact match, 0.0 otherwise
    """

    actual_answer = outputs.get("answer", "").lower().strip()
    expected_answer = ground_truth.get("answer", "").lower().strip()

    return 1.0 if actual_answer == expected_answer else 0.0

def confidence_evaluator(
    outputs: Dict[str, Any],
    inputs: Dict[str, Any],
    ground_truth: Dict[str, Any]
) -> float:
    """Check if confidence is appropriate.

    Returns:
        1.0 if high confidence, 0.5 if low confidence

    confidence = outputs.get("confidence", "low")
    return 1.0 if confidence == "high" else 0.5

Understanding Evaluators:

Input: Receives (outputs, inputs, ground_truth)
Output: Returns a score (typically 0.0 to 1.0)
Purpose: Automated quality assessment
Runs: After function executes, for each datapoint

5.10. Step 7: Run Experiment with Evaluators

Now run the experiment with evaluators:

result = evaluate(
    function=answer_question,
    dataset=dataset,
    evaluators=[exact_match_evaluator, confidence_evaluator],  # Added!
    name="qa-baseline-with-metrics-v1",
    verbose=True
)





print(f"\n✅ Experiment complete!")
print(f"📊 Run ID: {result.run_id}")
print(f"📈 Status: {result.status}")





# Access metrics
if result.metrics:
    print(f"\n📊 Aggregated Metrics:")
    # Metrics stored in model_extra for Pydantic v2
    extra_fields = getattr(result.metrics, "model_extra", {})
    for metric_name, metric_value in extra_fields.items():
        print(f"   {metric_name}: {metric_value:.2f}")

Expected Output:

Processing datapoint 1/3...
Processing datapoint 2/3...
Processing datapoint 3/3...
Running evaluators...

✅ Experiment complete!
📊 Run ID: run_xyz789...
📈 Status: completed

📊 Aggregated Metrics:
   exact_match_evaluator: 1.00
   confidence_evaluator: 1.00

5.11. Step 8: View Metrics in Dashboard

Go back to the HoneyHive dashboard:

Find your new run: qa-baseline-with-metrics-v1
Click to view details
You’ll now see: - Metrics tab: Aggregated scores - Per-datapoint metrics: Individual scores - Metric trends: Compare across runs

What You’ll See:

Exact match score: 100% (3/3 correct)
Confidence score: 100% (all high confidence)
Metrics visualized as charts
Per-session metrics in session details

5.12. Step 9: Test an Improvement

Let’s test an improved version WITH evaluators:

def answer_question_improved(datapoint: Dict[str, Any]) -> Dict[str, Any]:
    """Improved version with better logic."""
    inputs = datapoint.get("inputs", {})
    question = inputs.get("question", "").lower()

    # More sophisticated keyword matching
    answers = {
        "capital of france": "Paris",
        "2+2": "4",
        "color of the sky": "blue",
        "color is the sky": "blue"
    }

    # Check each pattern
    for pattern, ans in answers.items():
        if all(word in question for word in pattern.split()):
            return {"answer": ans, "confidence": "high"}

    return {"answer": "I don't know", "confidence": "low"}

# Run improved version WITH EVALUATORS
result_v2 = evaluate(
    function=answer_question_improved,
    dataset=dataset,
    evaluators=[exact_match_evaluator, confidence_evaluator],  # Same evaluators!
    name="qa-improved-with-metrics-v1",
    verbose=True
)

print(f"\n✅ Improved version complete!")
print(f"📊 Run ID: {result_v2.run_id}")

# Compare metrics
if result_v2.metrics:
    print(f"\n📊 Metrics:")
    extra_fields = getattr(result_v2.metrics, "model_extra", {})
    for metric_name, metric_value in extra_fields.items():
        print(f"   {metric_name}: {metric_value:.2f}")

Now you have TWO runs to compare!

Compare in the Dashboard OR via API:

Note

HoneyHive vs HoneyHiveTracer: HoneyHiveTracer (used in previous tutorials) handles tracing and observability. HoneyHive is the API client for managing HoneyHive resources like experiment results, datasets, and projects.

# Option 1: View comparison in HoneyHive dashboard (visual)
# Go to: https://app.honeyhive.ai/evaluate → Select runs → Click Compare

# Option 2: Programmatic comparison via API
from honeyhive.experiments import compare_runs
from honeyhive import HoneyHive

client = HoneyHive(api_key=os.environ["HH_API_KEY"])
comparison = compare_runs(
    client=client,
    new_run_id=result_v2.run_id,
    old_run_id=result.run_id
)

print(f"\nProgrammatic Comparison:")
print(f"Common datapoints: {comparison.common_datapoints}")
print(f"Improved metrics: {comparison.list_improved_metrics()}")
print(f"Degraded metrics: {comparison.list_degraded_metrics()}")

# Access detailed metric deltas
for metric_name, delta in comparison.metric_deltas.items():
    old_val = delta.get("old_aggregate", 0)
    new_val = delta.get("new_aggregate", 0)
    change = new_val - old_val
    print(f"{metric_name}: {old_val:.2f} → {new_val:.2f} ({change:+.2f})")

Tip

Use both approaches:

Dashboard for visual exploration and sharing with team
API for automated decision-making and CI/CD pipelines

5.13. What You’ve Learned

Congratulations! You’ve:

✅ Created your first evaluation function ✅ Structured test data with inputs and ground truths ✅ Created evaluators to automatically score outputs ✅ Run experiments with evaluate() and evaluators ✅ Viewed results and metrics in HoneyHive dashboard ✅ Compared runs using both dashboard and API

Key Concepts:

Evaluation Function: Your application logic under test
Dataset: Test cases with inputs and ground truths
Evaluators: Automated scoring functions
Metrics: Quantitative measurements of quality
Comparison: Compare runs via dashboard (visual) or API (programmatic)

5.14. Next Steps

Now that you understand the basics:

Creating Evaluators - Add automated scoring
Comparing Experiments - Compare runs statistically
Using Datasets in Experiments - Use datasets from HoneyHive UI
Best Practices - Production experiment patterns

5.15. Complete Code

Here’s the complete code from this tutorial:

# my_first_experiment.py
import os
from typing import Any, Dict
from honeyhive.experiments import evaluate


os.environ["HH_API_KEY"] = "your-api-key-here"
os.environ["HH_PROJECT"] = "experiments-tutorial"


def answer_question(datapoint: Dict[str, Any]) -> Dict[str, Any]:
    """Answer a trivia question."""
    inputs = datapoint.get("inputs", {})
    question = inputs.get("question", "")


    if "capital" in question.lower() and "france" in question.lower():
        answer = "Paris"
    elif "2+2" in question:
        answer = "4"
    elif "color" in question.lower() and "sky" in question.lower():
        answer = "blue"
    else:
        answer = "I don't know"


    return {"answer": answer, "confidence": "high" if answer != "I don't know" else "low"}


dataset = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "ground_truth": {"answer": "Paris"}
    },
    {
        "inputs": {"question": "What is 2+2?"},
        "ground_truth": {"answer": "4"}
    },
    {
        "inputs": {"question": "What color is the sky?"},
        "ground_truth": {"answer": "blue"}
    }
]


# Define evaluators
def exact_match_evaluator(
    outputs: Dict[str, Any],
    inputs: Dict[str, Any],
    ground_truth: Dict[str, Any]
) -> float:
    """Check if answer exactly matches ground truth."""
    actual = outputs.get("answer", "").lower().strip()
    expected = ground_truth.get("answer", "").lower().strip()
    return 1.0 if actual == expected else 0.0


def confidence_evaluator(
    outputs: Dict[str, Any],
    inputs: Dict[str, Any],
    ground_truth: Dict[str, Any]
) -> float:
    """Check if confidence is appropriate."""
    confidence = outputs.get("confidence", "low")
    return 1.0 if confidence == "high" else 0.5


# Run experiment with evaluators
result = evaluate(
    function=answer_question,
    dataset=dataset,
    evaluators=[exact_match_evaluator, confidence_evaluator],
    name="qa-baseline-with-metrics-v1",
    verbose=True
)


print(f"\n✅ Experiment complete! Run ID: {result.run_id}")


# Print metrics
if result.metrics:
    print(f"\n📊 Metrics:")
    extra_fields = getattr(result.metrics, "model_extra", {})
    for metric_name, metric_value in extra_fields.items():
        print(f"   {metric_name}: {metric_value:.2f}")