Experiments Module

Complete API reference for the HoneyHive experiments framework - evaluate LLM outputs, compare models, and analyze performance at scale.

Note

The experiments module replaces the deprecated evaluation module with improved architecture, better tracer integration, and backend-powered aggregation.

Overview

The experiments module provides a comprehensive framework for:

Automated Evaluation: Run custom evaluators against LLM outputs
Dataset Management: Support for both external and HoneyHive-managed datasets
Results Analysis: Backend-aggregated metrics and comparison tools
A/B Testing: Compare multiple experiment runs with detailed metrics

Quick Start

Basic Experiment

from honeyhive.experiments import evaluate, evaluator

@evaluator
def accuracy_check(outputs, inputs, ground_truth):
    """Check if output matches expected result."""
    return {
        "score": 1.0 if outputs == ground_truth else 0.0,
        "passed": outputs == ground_truth
    }

# Run experiment
result = evaluate(
    function=my_llm_function,
    dataset=[
        {"inputs": {"query": "What is 2+2?"}, "ground_truth": {"answer": "4"}},
        {"inputs": {"query": "What is 3+3?"}, "ground_truth": {"answer": "6"}},
    ],
    evaluators=[accuracy_check],
    api_key="your-api-key",
    project="your-project",
    name="accuracy-test"
)

print(f"Success: {result.success}")
print(f"Passed: {result.passed}/{result.passed + result.failed}")

Module Contents

Core Functions

Core Functions

Primary functions for running experiments and managing execution.

Evaluators

Evaluators

Decorator-based evaluator system for defining custom quality checks.

Results

Results Retrieval

Functions for retrieving and comparing experiment results.

Data Models

Data Models

Pydantic models for experiment runs, results, and comparisons.

Utilities

Utility Functions

Helper functions for dataset preparation and ID generation.

Key Concepts

Experiments vs Traces

Traces capture what happened during execution (spans, events, timing).

Experiments evaluate how well it happened (quality, accuracy, performance).

They work together:

from honeyhive import HoneyHiveTracer
from honeyhive.experiments import evaluate, evaluator

# Tracer captures execution details
tracer = HoneyHiveTracer(api_key="key", project="project")

# Evaluator assesses quality
@evaluator
def quality_check(outputs, inputs, ground_truth):
    return {"score": calculate_quality(outputs, ground_truth)}

# evaluate() runs function with both tracing + evaluation
result = evaluate(
    function=traced_llm_call,
    dataset=test_cases,
    evaluators=[quality_check],
    api_key="key",
    project="project"
)

External vs Managed Datasets

External Datasets - Your own test data:

# SDK generates EXT- prefixed IDs
result = evaluate(
    function=my_function,
    dataset=[
        {"inputs": {...}, "ground_truth": {...}},
        {"inputs": {...}, "ground_truth": {...}},
    ],
    # ... other params
)

Managed Datasets - Stored in HoneyHive:

# Reference existing dataset by ID
result = evaluate(
    function=my_function,
    dataset_id="dataset-abc-123",
    # ... other params
)

Evaluator Architecture

Modern decorator-based approach (not class inheritance):

@evaluator
def sync_evaluator(outputs, inputs, ground_truth):
    """Synchronous evaluator."""
    return {"score": 0.9}

@aevaluator
async def async_evaluator(outputs, inputs, ground_truth):
    """Asynchronous evaluator."""
    result = await external_api_call(outputs)
    return {"score": result.score}

Aggregation & Comparison

Backend handles aggregation automatically:

from honeyhive.experiments import get_run_result, compare_runs

# Get aggregated results
result = get_run_result(client, run_id="run-123")
print(f"Average score: {result.metrics.get_metric('accuracy')}")

# Compare two runs
comparison = compare_runs(
    client=client,
    new_run_id="run-new",
    old_run_id="run-old"
)

print(f"Common datapoints: {comparison.common_datapoints}")
print(f"Improved metrics: {comparison.list_improved_metrics()}")
print(f"Degraded metrics: {comparison.list_degraded_metrics()}")

Migration from evaluation Module

The evaluation module is deprecated. Migrate to experiments:

Import Changes

# OLD
from honeyhive.evaluation import evaluate, BaseEvaluator

# NEW
from honeyhive.experiments import evaluate, evaluator

Evaluator Pattern Changes

# OLD - Class-based
class MyEvaluator(BaseEvaluator):
    def evaluate(self, inputs, outputs, ground_truth):
        return {"score": 0.9}

# NEW - Decorator-based
@evaluator
def my_evaluator(outputs, inputs, ground_truth):
    return {"score": 0.9}

Function Signature Changes

# OLD
evaluate(
    inputs=inputs,
    outputs=outputs,
    evaluators=[my_evaluator]
)

# NEW
evaluate(
    function=my_function,
    dataset=dataset,
    evaluators=[my_evaluator],
    api_key="key",
    project="project"
)