Experiments Module

Complete API reference for the HoneyHive experiments framework - evaluate LLM outputs, compare models, and analyze performance at scale.

Note

The experiments module replaces the deprecated evaluation module with improved architecture, better tracer integration, and backend-powered aggregation.

Overview

The experiments module provides a comprehensive framework for:

  • Automated Evaluation: Run custom evaluators against LLM outputs

  • Dataset Management: Support for both external and HoneyHive-managed datasets

  • Results Analysis: Backend-aggregated metrics and comparison tools

  • A/B Testing: Compare multiple experiment runs with detailed metrics

Quick Start

Basic Experiment

from honeyhive.experiments import evaluate, evaluator

@evaluator
def accuracy_check(outputs, inputs, ground_truth):
    """Check if output matches expected result."""
    return {
        "score": 1.0 if outputs == ground_truth else 0.0,
        "passed": outputs == ground_truth
    }

# Run experiment
result = evaluate(
    function=my_llm_function,
    dataset=[
        {"inputs": {"query": "What is 2+2?"}, "ground_truth": {"answer": "4"}},
        {"inputs": {"query": "What is 3+3?"}, "ground_truth": {"answer": "6"}},
    ],
    evaluators=[accuracy_check],
    api_key="your-api-key",
    project="your-project",
    name="accuracy-test"
)

print(f"Success: {result.success}")
print(f"Passed: {result.passed}/{result.passed + result.failed}")

Module Contents

Core Functions

Primary functions for running experiments and managing execution.

Evaluators

Decorator-based evaluator system for defining custom quality checks.

Results

Functions for retrieving and comparing experiment results.

Data Models

Pydantic models for experiment runs, results, and comparisons.

Utilities

Helper functions for dataset preparation and ID generation.

Key Concepts

Experiments vs Traces

Traces capture what happened during execution (spans, events, timing).

Experiments evaluate how well it happened (quality, accuracy, performance).

They work together:

from honeyhive import HoneyHiveTracer
from honeyhive.experiments import evaluate, evaluator

# Tracer captures execution details
tracer = HoneyHiveTracer(api_key="key", project="project")

# Evaluator assesses quality
@evaluator
def quality_check(outputs, inputs, ground_truth):
    return {"score": calculate_quality(outputs, ground_truth)}

# evaluate() runs function with both tracing + evaluation
result = evaluate(
    function=traced_llm_call,
    dataset=test_cases,
    evaluators=[quality_check],
    api_key="key",
    project="project"
)

External vs Managed Datasets

External Datasets - Your own test data:

# SDK generates EXT- prefixed IDs
result = evaluate(
    function=my_function,
    dataset=[
        {"inputs": {...}, "ground_truth": {...}},
        {"inputs": {...}, "ground_truth": {...}},
    ],
    # ... other params
)

Managed Datasets - Stored in HoneyHive:

# Reference existing dataset by ID
result = evaluate(
    function=my_function,
    dataset_id="dataset-abc-123",
    # ... other params
)

Evaluator Architecture

Modern decorator-based approach (not class inheritance):

@evaluator
def sync_evaluator(outputs, inputs, ground_truth):
    """Synchronous evaluator."""
    return {"score": 0.9}

@aevaluator
async def async_evaluator(outputs, inputs, ground_truth):
    """Asynchronous evaluator."""
    result = await external_api_call(outputs)
    return {"score": result.score}

Aggregation & Comparison

Backend handles aggregation automatically:

from honeyhive.experiments import get_run_result, compare_runs

# Get aggregated results
result = get_run_result(client, run_id="run-123")
print(f"Average score: {result.metrics.get_metric('accuracy')}")

# Compare two runs
comparison = compare_runs(
    client=client,
    new_run_id="run-new",
    old_run_id="run-old"
)

print(f"Common datapoints: {comparison.common_datapoints}")
print(f"Improved metrics: {comparison.list_improved_metrics()}")
print(f"Degraded metrics: {comparison.list_degraded_metrics()}")

Migration from evaluation Module

The evaluation module is deprecated. Migrate to experiments:

Import Changes

# OLD
from honeyhive.evaluation import evaluate, BaseEvaluator

# NEW
from honeyhive.experiments import evaluate, evaluator

Evaluator Pattern Changes

# OLD - Class-based
class MyEvaluator(BaseEvaluator):
    def evaluate(self, inputs, outputs, ground_truth):
        return {"score": 0.9}

# NEW - Decorator-based
@evaluator
def my_evaluator(outputs, inputs, ground_truth):
    return {"score": 0.9}

Function Signature Changes

# OLD
evaluate(
    inputs=inputs,
    outputs=outputs,
    evaluators=[my_evaluator]
)

# NEW
evaluate(
    function=my_function,
    dataset=dataset,
    evaluators=[my_evaluator],
    api_key="key",
    project="project"
)

See Also