5. Tutorial 5: Run Your First Experiment

Note

Tutorial (15-20 minutes)

This is a hands-on tutorial that takes you step-by-step through running your first experiment with HoneyHive. You’ll create a working example and see results in the dashboard.

5.1. What You’ll Learn

By the end of this tutorial, you’ll know how to:

  • Run an experiment with evaluate()

  • Structure test data with inputs and ground truths

  • Create evaluators to automatically score outputs

  • View metrics and scores in HoneyHive dashboard

  • Compare different versions of your function

5.2. What You’ll Build

A complete question-answering experiment with automated evaluation. You’ll:

  1. Create a baseline QA function

  2. Test it against a dataset

  3. Add evaluators to automatically score outputs

  4. Compare baseline vs improved version using metrics

  5. View results and metrics in HoneyHive dashboard

5.3. Prerequisites

Before starting this tutorial, you should:

  • Complete Set Up Your First Tracer

  • Have Python 3.11 or higher installed

  • Have a HoneyHive API key

  • Basic familiarity with Python dictionaries

If you haven’t set up the SDK yet, go back to Tutorial 1.

5.4. Step 1: Install and Setup

First, create a new Python file for this tutorial:

touch my_first_experiment.py

Add the necessary imports and setup:

# my_first_experiment.py
import os
from typing import Any, Dict
from honeyhive.experiments import evaluate

# Set your API key
os.environ["HH_API_KEY"] = "your-api-key-here"
os.environ["HH_PROJECT"] = "experiments-tutorial"

Tip

Store your API key in a .env file instead of hardcoding it. See Production Deployment Guide for production best practices.

5.5. Step 2: Define Your Function

Create a simple function that answers questions. This will be the function we test in our experiment:

def answer_question(datapoint: Dict[str, Any]) -> Dict[str, Any]:
    """Answer a trivia question.

    This is the function we'll test in our experiment.

    Args:
        datapoint: Contains 'inputs' with the question

    Returns:
        Dictionary with the answer

    Note:
        The evaluation function can also accept a 'tracer' parameter if you need
        to access the tracer instance within your function for manual tracing:

        def answer_question(datapoint: Dict[str, Any], tracer: HoneyHiveTracer) -> Dict[str, Any]:
            # Use tracer for custom spans, enrichment, etc.
            pass
    """
    inputs = datapoint.get("inputs", {})
    question = inputs.get("question", "")

    # Simple logic: check for keywords
    # (In real use, you'd call an LLM here)
    if "capital" in question.lower() and "france" in question.lower():
        answer = "Paris"
    elif "2+2" in question:
        answer = "4"
    elif "color" in question.lower() and "sky" in question.lower():
        answer = "blue"
    else:
        answer = "I don't know"

    return {
        "answer": answer,
        "confidence": "high" if answer != "I don't know" else "low"
    }

Note

This example uses simple logic for demonstration. In a real experiment, you’d call an LLM API (OpenAI, Anthropic, etc.) inside this function.

5.6. Step 3: Create Your Test Dataset

Define a dataset with questions and expected answers:

dataset = [
    {
        "inputs": {
            "question": "What is the capital of France?"
        },
        "ground_truth": {
            "answer": "Paris",
            "category": "geography"
        }
    },
    {
        "inputs": {
            "question": "What is 2+2?"
        },
        "ground_truth": {
            "answer": "4",
            "category": "math"
        }
    },
    {
        "inputs": {
            "question": "What color is the sky?"
        },
        "ground_truth": {
            "answer": "blue",
            "category": "science"
        }
    }
]

Understanding the Structure:

  • inputs: What your function receives

  • ground_truth: The expected correct answers (used for evaluation)

5.7. Step 4: Run Your Experiment

Now run the experiment:

result = evaluate(
    function=answer_question,
    dataset=dataset,
    name="qa-baseline-v1",
    verbose=True  # Show progress
)





print(f"\n✅ Experiment complete!")
print(f"📊 Run ID: {result.run_id}")
print(f"📈 Status: {result.status}")

Run it:

python my_first_experiment.py

Expected Output:

Processing datapoint 1/3...
Processing datapoint 2/3...
Processing datapoint 3/3...

✅ Experiment complete!
📊 Run ID: run_abc123...
📈 Status: completed

5.8. Step 5: View Results in Dashboard

  1. Go to HoneyHive Experiments Dashboard

  2. Navigate to your project: experiments-tutorial

  3. Find your run: qa-baseline-v1

  1. Click to view: - Session traces for each question - Function outputs - Ground truths - Session metadata

What You’ll See:

  • 3 sessions (one per datapoint)

  • Each session shows inputs and outputs

  • Ground truths displayed for comparison

  • Session names include your experiment name

5.9. Step 6: Add Evaluators for Automated Scoring

Viewing results manually is helpful, but let’s add evaluators to automatically score our function’s outputs:

def exact_match_evaluator(
    outputs: Dict[str, Any],
    inputs: Dict[str, Any],
    ground_truth: Dict[str, Any]
) -> float:
    """Check if answer exactly matches ground truth.

    Args:
        outputs: Function's output (from answer_question)
        inputs: Original inputs (not used here)
        ground_truth: Expected outputs

    Returns:
        1.0 if exact match, 0.0 otherwise
    """


    actual_answer = outputs.get("answer", "").lower().strip()
    expected_answer = ground_truth.get("answer", "").lower().strip()





    return 1.0 if actual_answer == expected_answer else 0.0





def confidence_evaluator(
    outputs: Dict[str, Any],
    inputs: Dict[str, Any],
    ground_truth: Dict[str, Any]
) -> float:
    """Check if confidence is appropriate.

    Returns:
        1.0 if high confidence, 0.5 if low confidence







    confidence = outputs.get("confidence", "low")
    return 1.0 if confidence == "high" else 0.5

Understanding Evaluators:

  • Input: Receives (outputs, inputs, ground_truth)

  • Output: Returns a score (typically 0.0 to 1.0)

  • Purpose: Automated quality assessment

  • Runs: After function executes, for each datapoint

5.10. Step 7: Run Experiment with Evaluators

Now run the experiment with evaluators:

result = evaluate(
    function=answer_question,
    dataset=dataset,
    evaluators=[exact_match_evaluator, confidence_evaluator],  # Added!
    name="qa-baseline-with-metrics-v1",
    verbose=True
)





print(f"\n✅ Experiment complete!")
print(f"📊 Run ID: {result.run_id}")
print(f"📈 Status: {result.status}")





# Access metrics
if result.metrics:
    print(f"\n📊 Aggregated Metrics:")
    # Metrics stored in model_extra for Pydantic v2
    extra_fields = getattr(result.metrics, "model_extra", {})
    for metric_name, metric_value in extra_fields.items():
        print(f"   {metric_name}: {metric_value:.2f}")

Expected Output:

Processing datapoint 1/3...
Processing datapoint 2/3...
Processing datapoint 3/3...
Running evaluators...





✅ Experiment complete!
📊 Run ID: run_xyz789...
📈 Status: completed





📊 Aggregated Metrics:
   exact_match_evaluator: 1.00
   confidence_evaluator: 1.00

5.11. Step 8: View Metrics in Dashboard

Go back to the HoneyHive dashboard:

  1. Find your new run: qa-baseline-with-metrics-v1

  2. Click to view details

  3. You’ll now see: - Metrics tab: Aggregated scores - Per-datapoint metrics: Individual scores - Metric trends: Compare across runs

What You’ll See:

  • Exact match score: 100% (3/3 correct)

  • Confidence score: 100% (all high confidence)

  • Metrics visualized as charts

  • Per-session metrics in session details

5.12. Step 9: Test an Improvement

Let’s test an improved version WITH evaluators:

def answer_question_improved(datapoint: Dict[str, Any]) -> Dict[str, Any]:
    """Improved version with better logic."""
    inputs = datapoint.get("inputs", {})
    question = inputs.get("question", "").lower()

    # More sophisticated keyword matching
    answers = {
        "capital of france": "Paris",
        "2+2": "4",
        "color of the sky": "blue",
        "color is the sky": "blue"
    }





    # Check each pattern
    for pattern, ans in answers.items():
        if all(word in question for word in pattern.split()):
            return {"answer": ans, "confidence": "high"}





    return {"answer": "I don't know", "confidence": "low"}





# Run improved version WITH EVALUATORS
result_v2 = evaluate(
    function=answer_question_improved,
    dataset=dataset,
    evaluators=[exact_match_evaluator, confidence_evaluator],  # Same evaluators!
    name="qa-improved-with-metrics-v1",
    verbose=True
)





print(f"\n✅ Improved version complete!")
print(f"📊 Run ID: {result_v2.run_id}")





# Compare metrics
if result_v2.metrics:
    print(f"\n📊 Metrics:")
    extra_fields = getattr(result_v2.metrics, "model_extra", {})
    for metric_name, metric_value in extra_fields.items():
        print(f"   {metric_name}: {metric_value:.2f}")

Now you have TWO runs to compare!

Compare in the Dashboard OR via API:

Note

HoneyHive vs HoneyHiveTracer: HoneyHiveTracer (used in previous tutorials) handles tracing and observability. HoneyHive is the API client for managing HoneyHive resources like experiment results, datasets, and projects.

# Option 1: View comparison in HoneyHive dashboard (visual)
# Go to: https://app.honeyhive.ai/evaluate → Select runs → Click Compare





# Option 2: Programmatic comparison via API
from honeyhive.experiments import compare_runs
from honeyhive import HoneyHive





client = HoneyHive(api_key=os.environ["HH_API_KEY"])
comparison = compare_runs(
    client=client,
    new_run_id=result_v2.run_id,
    old_run_id=result.run_id
)





print(f"\nProgrammatic Comparison:")
print(f"Common datapoints: {comparison.common_datapoints}")
print(f"Improved metrics: {comparison.list_improved_metrics()}")
print(f"Degraded metrics: {comparison.list_degraded_metrics()}")





# Access detailed metric deltas
for metric_name, delta in comparison.metric_deltas.items():
    old_val = delta.get("old_aggregate", 0)
    new_val = delta.get("new_aggregate", 0)
    change = new_val - old_val
    print(f"{metric_name}: {old_val:.2f}{new_val:.2f} ({change:+.2f})")

Tip

Use both approaches:

  • Dashboard for visual exploration and sharing with team

  • API for automated decision-making and CI/CD pipelines

5.13. What You’ve Learned

Congratulations! You’ve:

✅ Created your first evaluation function ✅ Structured test data with inputs and ground truths ✅ Created evaluators to automatically score outputs ✅ Run experiments with evaluate() and evaluators ✅ Viewed results and metrics in HoneyHive dashboard ✅ Compared runs using both dashboard and API

Key Concepts:

  • Evaluation Function: Your application logic under test

  • Dataset: Test cases with inputs and ground truths

  • Evaluators: Automated scoring functions

  • Metrics: Quantitative measurements of quality

  • Comparison: Compare runs via dashboard (visual) or API (programmatic)

5.14. Next Steps

Now that you understand the basics:

5.15. Complete Code

Here’s the complete code from this tutorial:

# my_first_experiment.py
import os
from typing import Any, Dict
from honeyhive.experiments import evaluate


os.environ["HH_API_KEY"] = "your-api-key-here"
os.environ["HH_PROJECT"] = "experiments-tutorial"


def answer_question(datapoint: Dict[str, Any]) -> Dict[str, Any]:
    """Answer a trivia question."""
    inputs = datapoint.get("inputs", {})
    question = inputs.get("question", "")


    if "capital" in question.lower() and "france" in question.lower():
        answer = "Paris"
    elif "2+2" in question:
        answer = "4"
    elif "color" in question.lower() and "sky" in question.lower():
        answer = "blue"
    else:
        answer = "I don't know"


    return {"answer": answer, "confidence": "high" if answer != "I don't know" else "low"}


dataset = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "ground_truth": {"answer": "Paris"}
    },
    {
        "inputs": {"question": "What is 2+2?"},
        "ground_truth": {"answer": "4"}
    },
    {
        "inputs": {"question": "What color is the sky?"},
        "ground_truth": {"answer": "blue"}
    }
]


# Define evaluators
def exact_match_evaluator(
    outputs: Dict[str, Any],
    inputs: Dict[str, Any],
    ground_truth: Dict[str, Any]
) -> float:
    """Check if answer exactly matches ground truth."""
    actual = outputs.get("answer", "").lower().strip()
    expected = ground_truth.get("answer", "").lower().strip()
    return 1.0 if actual == expected else 0.0


def confidence_evaluator(
    outputs: Dict[str, Any],
    inputs: Dict[str, Any],
    ground_truth: Dict[str, Any]
) -> float:
    """Check if confidence is appropriate."""
    confidence = outputs.get("confidence", "low")
    return 1.0 if confidence == "high" else 0.5


# Run experiment with evaluators
result = evaluate(
    function=answer_question,
    dataset=dataset,
    evaluators=[exact_match_evaluator, confidence_evaluator],
    name="qa-baseline-with-metrics-v1",
    verbose=True
)


print(f"\n✅ Experiment complete! Run ID: {result.run_id}")


# Print metrics
if result.metrics:
    print(f"\n📊 Metrics:")
    extra_fields = getattr(result.metrics, "model_extra", {})
    for metric_name, metric_value in extra_fields.items():
        print(f"   {metric_name}: {metric_value:.2f}")