Running Experiments

How do I run experiments to test my LLM application?

Use the evaluate() function to run your application across a dataset and track results.

What’s the simplest way to run an experiment?

Three-Step Pattern

Changed in version 1.0: Function signature changed from (inputs, ground_truth) to (datapoint: Dict[str, Any]).

from typing import Any, Dict
from honeyhive.experiments import evaluate


# Step 1: Define your function
def my_llm_app(datapoint: Dict[str, Any]) -> Dict[str, Any]:
    """Your application logic.

    Args:
        datapoint: Contains 'inputs' and 'ground_truth'

    Returns:
        Dictionary with your function's outputs
    """
    inputs = datapoint.get("inputs", {})
    result = call_llm(inputs["prompt"])
    return {"answer": result}


# Step 2: Create dataset
dataset = [
    {
        "inputs": {"prompt": "What is AI?"},
        "ground_truth": {"answer": "Artificial Intelligence..."}
    }
]


# Step 3: Run experiment
result = evaluate(
    function=my_llm_app,
    dataset=dataset,
    api_key="your-api-key",
    project="your-project",
    name="My Experiment v1"
)


print(f"✅ Run ID: {result.run_id}")
print(f"✅ Status: {result.status}")

Important

Think of Your Evaluation Function as a Scaffold

The evaluation function’s job is to take datapoints from your dataset and convert them into the right format to invoke your main AI processing functions. It’s a thin adapter layer that:

Extracts inputs from the datapoint
Calls your actual application logic (call_llm, process_query, rag_pipeline, etc.)
Returns the results in a format that evaluators can use

Keep the evaluation function simple - the real logic lives in your application functions.

How should I structure my test data?

Use inputs + ground_truth Pattern

Each datapoint in your dataset should have:

{
    "inputs": {
        # Parameters passed to your function
        "query": "user question",
        "context": "additional info",
        "model": "gpt-4"
    },
    "ground_truth": {
        # Expected outputs (optional but recommended)
        "answer": "expected response",
        "category": "classification",
        "score": 0.95
    }
}

Complete Example:

dataset = [
    {
        "inputs": {
            "question": "What is the capital of France?",
            "language": "English"
        },
        "ground_truth": {
            "answer": "Paris",
            "confidence": "high"
        }
    },
    {
        "inputs": {
            "question": "What is 2+2?",
            "language": "English"
        },
        "ground_truth": {
            "answer": "4",
            "confidence": "absolute"
        }
    }
]

What signature must my function have?

Accept datapoint Parameter (v1.0)

Changed in version 1.0: Function signature changed from (inputs, ground_truth) to (datapoint: Dict[str, Any]).

Your function MUST accept a datapoint parameter, and can optionally accept a tracer parameter:

from typing import Any, Dict
from honeyhive import HoneyHiveTracer


# Option 1: Basic signature (datapoint only)
def my_function(datapoint: Dict[str, Any]) -> Dict[str, Any]:
    """Your evaluation function.

    Args:
        datapoint: Dictionary with 'inputs' and 'ground_truth' keys

    Returns:
        dict: Your function's output
    """
    # Extract inputs and ground_truth
    inputs = datapoint.get("inputs", {})
    ground_truth = datapoint.get("ground_truth", {})


    # Access input parameters
    user_query = inputs.get("question")
    language = inputs.get("language", "English")


    # ground_truth available but typically not used in function
    # (used by evaluators for scoring)


    # Your logic
    result = process_query(user_query, language)


    # Return dict
    return {"answer": result, "metadata": {...}}


# Option 2: With tracer parameter (for advanced tracing)
def my_function_with_tracer(
    datapoint: Dict[str, Any],
    tracer: HoneyHiveTracer  # Optional - auto-injected by evaluate()
) -> Dict[str, Any]:
    """Evaluation function with tracer access.

    Args:
        datapoint: Dictionary with 'inputs' and 'ground_truth' keys
        tracer: HoneyHiveTracer instance (optional, auto-provided)

    Returns:
        dict: Your function's output
    """
    inputs = datapoint.get("inputs", {})

    # Use tracer for enrichment
    tracer.enrich_session(metadata={"user_id": inputs.get("user_id")})

    result = process_query(inputs["question"])

    return {"answer": result}

Important

Required Parameters:

Accept datapoint: Dict[str, Any] as first parameter (required)

Optional Parameters:

Accept tracer: HoneyHiveTracer as second parameter (optional - auto-injected by evaluate())

Requirements:

Extract inputs with datapoint.get("inputs", {})
Extract ground_truth with datapoint.get("ground_truth", {})
Return value should be a dictionary
Type hints are strongly recommended

Backward Compatibility (Deprecated):

Deprecated since version 1.0: The old (inputs, ground_truth) signature is deprecated but still supported for backward compatibility. It will be removed in v2.0.

# ⚠️ Deprecated: Old signature (still works in v1.0)
def old_style_function(inputs, ground_truth):
    # This still works but will be removed in v2.0
    return {"output": inputs["query"]}


# ✅ Recommended: New signature (v1.0+)
def new_style_function(datapoint: Dict[str, Any]) -> Dict[str, Any]:
    inputs = datapoint.get("inputs", {})
    return {"output": inputs["query"]}

Can I use async functions with evaluate()?

Added in version 1.0: The evaluate() function now supports async functions.

Yes! Async functions are fully supported.

If your application uses async operations (like async LLM clients), you can pass an async function directly to evaluate(). Async functions are automatically detected and executed correctly.

from typing import Any, Dict
from honeyhive.experiments import evaluate
import asyncio


# Option 1: Basic async function
async def my_async_function(datapoint: Dict[str, Any]) -> Dict[str, Any]:
    """Async evaluation function.

    Args:
        datapoint: Dictionary with 'inputs' and 'ground_truth' keys

    Returns:
        dict: Your function's output
    """
    inputs = datapoint.get("inputs", {})

    # Use async operations (e.g., async LLM client)
    result = await async_llm_call(inputs["prompt"])

    return {"answer": result}


# Option 2: Async function with tracer parameter
async def my_async_function_with_tracer(
    datapoint: Dict[str, Any],
    tracer: HoneyHiveTracer
) -> Dict[str, Any]:
    """Async evaluation function with tracer access.

    Args:
        datapoint: Dictionary with 'inputs' and 'ground_truth' keys
        tracer: HoneyHiveTracer instance (auto-injected)

    Returns:
        dict: Your function's output
    """
    inputs = datapoint.get("inputs", {})

    # Use tracer for enrichment
    tracer.enrich_session(metadata={"async": True})

    # Use async operations
    result = await async_llm_call(inputs["prompt"])

    return {"answer": result}


# Run experiment with async function - works the same as sync!
result = evaluate(
    function=my_async_function,
    dataset=dataset,
    api_key="your-api-key",
    project="your-project",
    name="Async Experiment v1"
)

Note

How it works:

Async functions are automatically detected using asyncio.iscoroutinefunction()
Each datapoint is processed in a separate thread using ThreadPoolExecutor
Async functions are executed with asyncio.run() inside each worker thread
Both sync and async functions work seamlessly with the optional tracer parameter

When to use async functions:

When using async LLM clients (e.g., openai.AsyncOpenAI)
When making concurrent API calls within your function
When your existing application code is already async

How do I use ground_truth from datapoints in my experiments?

Client-Side vs Server-Side Evaluators

The ground_truth from your datapoints can be used by evaluators to measure quality. Choose between client-side or server-side evaluation based on your architecture.

Client-Side Evaluators (Recommended)

Pass data down to the evaluation function so it’s available for client-side evaluators:

from typing import Any, Dict
from honeyhive.experiments import evaluate

def my_llm_app(datapoint: Dict[str, Any]) -> Dict[str, Any]:
    """Evaluation function that passes through data for evaluators."""
    inputs = datapoint.get("inputs", {})
    ground_truth = datapoint.get("ground_truth", {})

    # Call your LLM
    result = call_llm(inputs["prompt"])

    # Return outputs AND pass through ground_truth for evaluators
    return {
        "answer": result,
        "ground_truth": ground_truth,  # Make available to evaluators
        "intermediate_steps": [...]    # Any other data for evaluation
    }

# Your evaluator receives both the output and datapoint context
def accuracy_evaluator(output: Dict[str, Any], datapoint: Dict[str, Any]) -> Dict[str, Any]:
    """Client-side evaluator with access to ground truth."""
    predicted = output["answer"]
    expected = output["ground_truth"]["answer"]  # From evaluation function output

    is_correct = predicted.lower() == expected.lower()
    return {
        "score": 1.0 if is_correct else 0.0,
        "metadata": {"predicted": predicted, "expected": expected}
    }

# Run evaluation with client-side evaluator
result = evaluate(
    function=my_llm_app,
    dataset=dataset,
    evaluators=[accuracy_evaluator],
    name="Accuracy Test"
)

Note

When to Use Client-Side Evaluators

Simple, self-contained evaluation logic
Evaluators that need access to intermediate steps
When you can easily pass data through the evaluation function
Faster feedback (no roundtrip to HoneyHive)

Server-Side Evaluators

For complex applications where it’s hard to pass intermediate steps, use enrich_session() to bring data up to the session level:

from typing import Any, Dict
from honeyhive import HoneyHiveTracer
from honeyhive.experiments import evaluate

def complex_app(datapoint: Dict[str, Any], tracer: HoneyHiveTracer) -> Dict[str, Any]:
    """Complex app with hard-to-pass intermediate steps."""
    inputs = datapoint.get("inputs", {})

    # Step 1: Document retrieval (deep in call stack)
    docs = retrieve_documents(inputs["query"])

    # Step 2: LLM call (deep in another function)
    result = generate_answer(inputs["query"], docs)

    # Instead of threading data through complex call stacks,
    # use enrich_session to make it available at session level
    tracer.enrich_session(
        outputs={
            "answer": result,
            "retrieved_docs": docs,
            "doc_count": len(docs)
        },
        metadata={
            "ground_truth": datapoint.get("ground_truth", {}),
            "experiment_version": "v2"
        }
    )

    return {"answer": result}

# Run evaluation - use server-side evaluators in HoneyHive dashboard
result = evaluate(
    function=complex_app,
    dataset=dataset,
    name="Complex App Evaluation"
)
# Then configure server-side evaluators in HoneyHive to compare
# session.outputs.answer against session.metadata.ground_truth.answer

Note

When to Use Server-Side Evaluators

Complex, nested application architectures
Intermediate steps are hard to pass through function calls
Need to evaluate data from multiple spans/sessions together
Want centralized evaluation logic in HoneyHive dashboard

Decision Matrix:

Scenario	Use Client-Side	Use Server-Side
Simple function	✅ Easy to pass data	❌ Overkill
Complex nested calls	❌ Hard to thread data	✅ Use enrich_session
Evaluation speed	✅ Faster (local)	⚠️ Slower (API roundtrip)
Centralized logic	❌ In code	✅ In dashboard
Team collaboration	⚠️ Requires code changes	✅ No code changes needed

How do I enrich sessions or spans during evaluation?

Added in version 1.0: You can now receive a tracer parameter in your evaluation function.

Use the tracer Parameter for Advanced Tracing

If your function needs to enrich sessions or use the tracer instance, add a tracer parameter to your function signature:

from typing import Any, Dict
from honeyhive import HoneyHiveTracer
from honeyhive.experiments import evaluate


def my_function(
    datapoint: Dict[str, Any],
    tracer: HoneyHiveTracer  # Optional tracer parameter
) -> Dict[str, Any]:
    """Function with tracer access.

    Args:
        datapoint: Test data with 'inputs' and 'ground_truth'
        tracer: HoneyHiveTracer instance (auto-injected)

    Returns:
        Function outputs
    """
    inputs = datapoint.get("inputs", {})


    # Enrich the session with metadata
    tracer.enrich_session(
        metadata={"experiment_version": "v2", "user_id": "test-123"}
    )


    # Call your application logic - enrich_span happens inside
    result = process_query(inputs["query"], tracer)


    return {"answer": result}


def process_query(query: str, tracer: HoneyHiveTracer) -> str:
    """Application logic that enriches spans.

    Call enrich_span from within your actual processing functions,
    not directly in the evaluation function.
    """
    # Do some processing
    result = call_llm(query)

    # Enrich the span with metrics from within this function
    tracer.enrich_span(
        metrics={"processing_time": 0.5, "token_count": 150},
        metadata={"model": "gpt-4", "temperature": 0.7}
    )

    return result


# The tracer is automatically provided by evaluate()
result = evaluate(
    function=my_function,
    dataset=dataset,
    name="experiment-v1"
)

Important

The tracer parameter is optional - only add it if needed
The tracer is automatically injected by evaluate()
Use it to call enrich_session() or access the tracer instance
Each datapoint gets its own tracer instance (multi-instance architecture)

Without tracer parameter (simpler):

def simple_function(datapoint: Dict[str, Any]) -> Dict[str, Any]:
    """Function without tracer access."""
    inputs = datapoint.get("inputs", {})
    return {"answer": process_query(inputs["query"])}

How do I trace third-party library calls in my evaluation?

Added in version 1.0: The evaluate() function now supports the instrumentors parameter.

Use the instrumentors Parameter for Automatic Tracing

If your evaluation function uses third-party libraries (OpenAI, Anthropic, Google ADK, LangChain, etc.), you can automatically trace their calls by passing instrumentor factory functions:

from typing import Any, Dict
from honeyhive.experiments import evaluate
from openinference.instrumentation.openai import OpenAIInstrumentor


def my_function(datapoint: Dict[str, Any]) -> Dict[str, Any]:
    """Evaluation function using OpenAI."""
    inputs = datapoint.get("inputs", {})

    # OpenAI calls will be automatically traced
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": inputs["prompt"]}]
    )

    return {"answer": response.choices[0].message.content}


# Pass instrumentor factories - each datapoint gets its own instance
result = evaluate(
    function=my_function,
    dataset=dataset,
    instrumentors=[lambda: OpenAIInstrumentor()],  # Factory function
    name="openai-traced-experiment"
)

Important

Why Factory Functions?

The instrumentors parameter accepts factory functions (callables that return instrumentor instances), not instrumentor instances directly. This ensures each datapoint gets its own isolated instrumentor instance, preventing trace routing issues in concurrent processing.

Correct: instrumentors=[lambda: OpenAIInstrumentor()]
Incorrect: instrumentors=[OpenAIInstrumentor()]

Multiple Instrumentors:

from openinference.instrumentation.openai import OpenAIInstrumentor
from openinference.instrumentation.langchain import LangChainInstrumentor

result = evaluate(
    function=my_function,
    dataset=dataset,
    instrumentors=[
        lambda: OpenAIInstrumentor(),
        lambda: LangChainInstrumentor(),
    ],
    name="multi-instrumented-experiment"
)

Google ADK Example:

from openinference.instrumentation.google_adk import GoogleADKInstrumentor
from google.adk.agents import Agent
from google.adk.runners import Runner


async def run_adk_agent(datapoint: Dict[str, Any]) -> Dict[str, Any]:
    """Run Google ADK agent - calls are automatically traced."""
    inputs = datapoint.get("inputs", {})

    agent = Agent(name="my_agent", model="gemini-2.0-flash", ...)
    runner = Runner(agent=agent, ...)

    # ADK agent calls will be traced
    response = await runner.run_async(...)

    return {"response": response}


result = evaluate(
    function=run_adk_agent,
    dataset=dataset,
    instrumentors=[lambda: GoogleADKInstrumentor()],
    name="adk-agent-evaluation"
)

Note

How it works:

Each datapoint gets its own tracer instance (multi-instance architecture)
For each datapoint, the SDK creates fresh instrumentor instances from your factories
Instrumentors are configured with the datapoint’s tracer provider via instrumentor.instrument(tracer_provider=tracer.provider)
This ensures all traces from that datapoint are routed to the correct session

Supported Instrumentors:

Any OpenInference-compatible instrumentor works with this pattern:

openinference.instrumentation.openai.OpenAIInstrumentor
openinference.instrumentation.anthropic.AnthropicInstrumentor
openinference.instrumentation.google_adk.GoogleADKInstrumentor
openinference.instrumentation.langchain.LangChainInstrumentor
openinference.instrumentation.llama_index.LlamaIndexInstrumentor
And many more…

My experiments are too slow on large datasets

Use max_workers for Parallel Processing

# Slow: Sequential processing (default)
result = evaluate(
    function=my_function,
    dataset=large_dataset,  # 1000 items
    api_key="your-api-key",
    project="your-project"
)
# Takes: ~1000 seconds if each item takes 1 second


# Fast: Parallel processing
result = evaluate(
    function=my_function,
    dataset=large_dataset,  # 1000 items
    max_workers=20,  # Process 20 items simultaneously
    api_key="your-api-key",
    project="your-project"
)
# Takes: ~50 seconds (20x faster)

Choosing max_workers:

# Conservative (good for API rate limits)
max_workers=5


# Balanced (good for most cases)
max_workers=10


# Aggressive (fast but watch rate limits)
max_workers=20

How do I avoid hardcoding credentials?

Use Environment Variables

import os


# Set environment variables
os.environ["HH_API_KEY"] = "your-api-key"
os.environ["HH_PROJECT"] = "your-project"


# Now you can omit api_key and project
result = evaluate(
    function=my_function,
    dataset=dataset,
    name="Experiment v1"
)

Or use a .env file:

# .env file
HH_API_KEY=your-api-key
HH_PROJECT=your-project
HH_SOURCE=dev  # Optional: environment identifier

from dotenv import load_dotenv
load_dotenv()


# Credentials loaded automatically
result = evaluate(
    function=my_function,
    dataset=dataset,
    name="Experiment v1"
)

How should I name my experiments?

Use Descriptive, Versioned Names

# ❌ Bad: Generic names
name="test"
name="experiment"
name="run1"


# ✅ Good: Descriptive names
name="gpt-3.5-baseline-v1"
name="improved-prompt-v2"
name="rag-with-reranking-v1"
name="production-candidate-2024-01-15"

Naming Convention:

# Format: {change-description}-{version}
evaluate(
    function=baseline_function,
    dataset=dataset,
    name="gpt-3.5-baseline-v1",
    api_key="your-api-key",
    project="your-project"
)


evaluate(
    function=improved_function,
    dataset=dataset,
    name="gpt-4-improved-v1",  # Easy to compare
    api_key="your-api-key",
    project="your-project"
)

How do I access experiment results in code?

Use the Returned EvaluationResult Object

result = evaluate(
    function=my_function,
    dataset=dataset,
    api_key="your-api-key",
    project="your-project"
)


# Access run information
print(f"Run ID: {result.run_id}")
print(f"Status: {result.status}")
print(f"Dataset ID: {result.dataset_id}")


# Access session IDs (one per datapoint)
print(f"Session IDs: {result.session_ids}")


# Access evaluation data
print(f"Results: {result.data}")


# Export to JSON
result.to_json()  # Saves to {suite_name}.json

I want to see what’s happening during evaluation

Enable Verbose Output

result = evaluate(
    function=my_function,
    dataset=dataset,
    verbose=True,  # Show progress
    api_key="your-api-key",
    project="your-project"
)


# Output:
# Processing datapoint 1/10...
# Processing datapoint 2/10...
# ...

Show me a complete real-world example

Question Answering Pipeline (v1.0)

from typing import Any, Dict
from honeyhive.experiments import evaluate
import openai
import os


# Setup
os.environ["HH_API_KEY"] = "your-honeyhive-key"
os.environ["HH_PROJECT"] = "qa-system"
openai.api_key = "your-openai-key"


# Define function to test
def qa_pipeline(datapoint: Dict[str, Any]) -> Dict[str, Any]:
    """Answer questions using GPT-4.

    Args:
        datapoint: Contains 'inputs' and 'ground_truth'

    Returns:
        Dictionary with answer, model, and token count
    """
    client = openai.OpenAI()


    inputs = datapoint.get("inputs", {})
    question = inputs["question"]
    context = inputs.get("context", "")


    prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"


    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0
    )


    return {
        "answer": response.choices[0].message.content,
        "model": "gpt-4",
        "tokens": response.usage.total_tokens
    }


# Create test dataset
dataset = [
    {
        "inputs": {
            "question": "What is machine learning?",
            "context": "ML is a subset of AI"
        },
        "ground_truth": {
            "answer": "Machine learning is a subset of artificial intelligence..."
        }
    },
    {
        "inputs": {
            "question": "What is deep learning?",
            "context": "DL uses neural networks"
        },
        "ground_truth": {
            "answer": "Deep learning uses neural networks..."
        }
    }
]


# Run experiment
result = evaluate(
    function=qa_pipeline,
    dataset=dataset,
    name="qa-gpt4-baseline-v1",
    max_workers=5,
    verbose=True
)


print(f"✅ Experiment complete!")
print(f"📊 Run ID: {result.run_id}")
print(f"🔗 View in dashboard: https://app.honeyhive.ai/projects/qa-system")

Running Experiments

How do I run experiments to test my LLM application?

What’s the simplest way to run an experiment?

How should I structure my test data?

What signature must my function have?

Can I use async functions with evaluate()?

How do I use ground_truth from datapoints in my experiments?

How do I enrich sessions or spans during evaluation?

How do I trace third-party library calls in my evaluation?

My experiments are too slow on large datasets

How do I avoid hardcoding credentials?

How should I name my experiments?

How do I access experiment results in code?

I want to see what’s happening during evaluation

Show me a complete real-world example

See Also