Running Experiments
How do I run experiments to test my LLM application?
Use the evaluate() function to run your application across a dataset and track results.
What’s the simplest way to run an experiment?
Three-Step Pattern
Changed in version 1.0: Function signature changed from (inputs, ground_truth) to (datapoint: Dict[str, Any]).
from typing import Any, Dict
from honeyhive.experiments import evaluate
# Step 1: Define your function
def my_llm_app(datapoint: Dict[str, Any]) -> Dict[str, Any]:
"""Your application logic.
Args:
datapoint: Contains 'inputs' and 'ground_truth'
Returns:
Dictionary with your function's outputs
"""
inputs = datapoint.get("inputs", {})
result = call_llm(inputs["prompt"])
return {"answer": result}
# Step 2: Create dataset
dataset = [
{
"inputs": {"prompt": "What is AI?"},
"ground_truth": {"answer": "Artificial Intelligence..."}
}
]
# Step 3: Run experiment
result = evaluate(
function=my_llm_app,
dataset=dataset,
api_key="your-api-key",
project="your-project",
name="My Experiment v1"
)
print(f"✅ Run ID: {result.run_id}")
print(f"✅ Status: {result.status}")
Important
Think of Your Evaluation Function as a Scaffold
The evaluation function’s job is to take datapoints from your dataset and convert them into the right format to invoke your main AI processing functions. It’s a thin adapter layer that:
Extracts
inputsfrom the datapointCalls your actual application logic (
call_llm,process_query,rag_pipeline, etc.)Returns the results in a format that evaluators can use
Keep the evaluation function simple - the real logic lives in your application functions.
How should I structure my test data?
Use inputs + ground_truth Pattern
Each datapoint in your dataset should have:
{
"inputs": {
# Parameters passed to your function
"query": "user question",
"context": "additional info",
"model": "gpt-4"
},
"ground_truth": {
# Expected outputs (optional but recommended)
"answer": "expected response",
"category": "classification",
"score": 0.95
}
}
Complete Example:
dataset = [
{
"inputs": {
"question": "What is the capital of France?",
"language": "English"
},
"ground_truth": {
"answer": "Paris",
"confidence": "high"
}
},
{
"inputs": {
"question": "What is 2+2?",
"language": "English"
},
"ground_truth": {
"answer": "4",
"confidence": "absolute"
}
}
]
What signature must my function have?
Accept datapoint Parameter (v1.0)
Changed in version 1.0: Function signature changed from (inputs, ground_truth) to (datapoint: Dict[str, Any]).
Your function MUST accept a datapoint parameter, and can optionally accept a tracer parameter:
from typing import Any, Dict
from honeyhive import HoneyHiveTracer
# Option 1: Basic signature (datapoint only)
def my_function(datapoint: Dict[str, Any]) -> Dict[str, Any]:
"""Your evaluation function.
Args:
datapoint: Dictionary with 'inputs' and 'ground_truth' keys
Returns:
dict: Your function's output
"""
# Extract inputs and ground_truth
inputs = datapoint.get("inputs", {})
ground_truth = datapoint.get("ground_truth", {})
# Access input parameters
user_query = inputs.get("question")
language = inputs.get("language", "English")
# ground_truth available but typically not used in function
# (used by evaluators for scoring)
# Your logic
result = process_query(user_query, language)
# Return dict
return {"answer": result, "metadata": {...}}
# Option 2: With tracer parameter (for advanced tracing)
def my_function_with_tracer(
datapoint: Dict[str, Any],
tracer: HoneyHiveTracer # Optional - auto-injected by evaluate()
) -> Dict[str, Any]:
"""Evaluation function with tracer access.
Args:
datapoint: Dictionary with 'inputs' and 'ground_truth' keys
tracer: HoneyHiveTracer instance (optional, auto-provided)
Returns:
dict: Your function's output
"""
inputs = datapoint.get("inputs", {})
# Use tracer for enrichment
tracer.enrich_session(metadata={"user_id": inputs.get("user_id")})
result = process_query(inputs["question"])
return {"answer": result}
Important
Required Parameters:
Accept
datapoint: Dict[str, Any]as first parameter (required)
Optional Parameters:
Accept
tracer: HoneyHiveTraceras second parameter (optional - auto-injected by evaluate())
Requirements:
Extract
inputswithdatapoint.get("inputs", {})Extract
ground_truthwithdatapoint.get("ground_truth", {})Return value should be a dictionary
Type hints are strongly recommended
Backward Compatibility (Deprecated):
Deprecated since version 1.0: The old (inputs, ground_truth) signature is deprecated but still supported
for backward compatibility. It will be removed in v2.0.
# ⚠️ Deprecated: Old signature (still works in v1.0)
def old_style_function(inputs, ground_truth):
# This still works but will be removed in v2.0
return {"output": inputs["query"]}
# ✅ Recommended: New signature (v1.0+)
def new_style_function(datapoint: Dict[str, Any]) -> Dict[str, Any]:
inputs = datapoint.get("inputs", {})
return {"output": inputs["query"]}
Can I use async functions with evaluate()?
Added in version 1.0: The evaluate() function now supports async functions.
Yes! Async functions are fully supported.
If your application uses async operations (like async LLM clients), you can pass an async function directly to evaluate(). Async functions are automatically detected and executed correctly.
from typing import Any, Dict
from honeyhive.experiments import evaluate
import asyncio
# Option 1: Basic async function
async def my_async_function(datapoint: Dict[str, Any]) -> Dict[str, Any]:
"""Async evaluation function.
Args:
datapoint: Dictionary with 'inputs' and 'ground_truth' keys
Returns:
dict: Your function's output
"""
inputs = datapoint.get("inputs", {})
# Use async operations (e.g., async LLM client)
result = await async_llm_call(inputs["prompt"])
return {"answer": result}
# Option 2: Async function with tracer parameter
async def my_async_function_with_tracer(
datapoint: Dict[str, Any],
tracer: HoneyHiveTracer
) -> Dict[str, Any]:
"""Async evaluation function with tracer access.
Args:
datapoint: Dictionary with 'inputs' and 'ground_truth' keys
tracer: HoneyHiveTracer instance (auto-injected)
Returns:
dict: Your function's output
"""
inputs = datapoint.get("inputs", {})
# Use tracer for enrichment
tracer.enrich_session(metadata={"async": True})
# Use async operations
result = await async_llm_call(inputs["prompt"])
return {"answer": result}
# Run experiment with async function - works the same as sync!
result = evaluate(
function=my_async_function,
dataset=dataset,
api_key="your-api-key",
project="your-project",
name="Async Experiment v1"
)
Note
How it works:
Async functions are automatically detected using
asyncio.iscoroutinefunction()Each datapoint is processed in a separate thread using
ThreadPoolExecutorAsync functions are executed with
asyncio.run()inside each worker threadBoth sync and async functions work seamlessly with the optional
tracerparameter
When to use async functions:
When using async LLM clients (e.g.,
openai.AsyncOpenAI)When making concurrent API calls within your function
When your existing application code is already async
How do I use ground_truth from datapoints in my experiments?
Client-Side vs Server-Side Evaluators
The ground_truth from your datapoints can be used by evaluators to measure quality. Choose between client-side or server-side evaluation based on your architecture.
Client-Side Evaluators (Recommended)
Pass data down to the evaluation function so it’s available for client-side evaluators:
from typing import Any, Dict
from honeyhive.experiments import evaluate
def my_llm_app(datapoint: Dict[str, Any]) -> Dict[str, Any]:
"""Evaluation function that passes through data for evaluators."""
inputs = datapoint.get("inputs", {})
ground_truth = datapoint.get("ground_truth", {})
# Call your LLM
result = call_llm(inputs["prompt"])
# Return outputs AND pass through ground_truth for evaluators
return {
"answer": result,
"ground_truth": ground_truth, # Make available to evaluators
"intermediate_steps": [...] # Any other data for evaluation
}
# Your evaluator receives both the output and datapoint context
def accuracy_evaluator(output: Dict[str, Any], datapoint: Dict[str, Any]) -> Dict[str, Any]:
"""Client-side evaluator with access to ground truth."""
predicted = output["answer"]
expected = output["ground_truth"]["answer"] # From evaluation function output
is_correct = predicted.lower() == expected.lower()
return {
"score": 1.0 if is_correct else 0.0,
"metadata": {"predicted": predicted, "expected": expected}
}
# Run evaluation with client-side evaluator
result = evaluate(
function=my_llm_app,
dataset=dataset,
evaluators=[accuracy_evaluator],
name="Accuracy Test"
)
Note
When to Use Client-Side Evaluators
Simple, self-contained evaluation logic
Evaluators that need access to intermediate steps
When you can easily pass data through the evaluation function
Faster feedback (no roundtrip to HoneyHive)
Server-Side Evaluators
For complex applications where it’s hard to pass intermediate steps, use enrich_session() to bring data up to the session level:
from typing import Any, Dict
from honeyhive import HoneyHiveTracer
from honeyhive.experiments import evaluate
def complex_app(datapoint: Dict[str, Any], tracer: HoneyHiveTracer) -> Dict[str, Any]:
"""Complex app with hard-to-pass intermediate steps."""
inputs = datapoint.get("inputs", {})
# Step 1: Document retrieval (deep in call stack)
docs = retrieve_documents(inputs["query"])
# Step 2: LLM call (deep in another function)
result = generate_answer(inputs["query"], docs)
# Instead of threading data through complex call stacks,
# use enrich_session to make it available at session level
tracer.enrich_session(
outputs={
"answer": result,
"retrieved_docs": docs,
"doc_count": len(docs)
},
metadata={
"ground_truth": datapoint.get("ground_truth", {}),
"experiment_version": "v2"
}
)
return {"answer": result}
# Run evaluation - use server-side evaluators in HoneyHive dashboard
result = evaluate(
function=complex_app,
dataset=dataset,
name="Complex App Evaluation"
)
# Then configure server-side evaluators in HoneyHive to compare
# session.outputs.answer against session.metadata.ground_truth.answer
Note
When to Use Server-Side Evaluators
Complex, nested application architectures
Intermediate steps are hard to pass through function calls
Need to evaluate data from multiple spans/sessions together
Want centralized evaluation logic in HoneyHive dashboard
Decision Matrix:
Scenario |
Use Client-Side |
Use Server-Side |
|---|---|---|
Simple function |
✅ Easy to pass data |
❌ Overkill |
Complex nested calls |
❌ Hard to thread data |
✅ Use enrich_session |
Evaluation speed |
✅ Faster (local) |
⚠️ Slower (API roundtrip) |
Centralized logic |
❌ In code |
✅ In dashboard |
Team collaboration |
⚠️ Requires code changes |
✅ No code changes needed |
How do I enrich sessions or spans during evaluation?
Added in version 1.0: You can now receive a tracer parameter in your evaluation function.
Use the tracer Parameter for Advanced Tracing
If your function needs to enrich sessions or use the tracer instance,
add a tracer parameter to your function signature:
from typing import Any, Dict
from honeyhive import HoneyHiveTracer
from honeyhive.experiments import evaluate
def my_function(
datapoint: Dict[str, Any],
tracer: HoneyHiveTracer # Optional tracer parameter
) -> Dict[str, Any]:
"""Function with tracer access.
Args:
datapoint: Test data with 'inputs' and 'ground_truth'
tracer: HoneyHiveTracer instance (auto-injected)
Returns:
Function outputs
"""
inputs = datapoint.get("inputs", {})
# Enrich the session with metadata
tracer.enrich_session(
metadata={"experiment_version": "v2", "user_id": "test-123"}
)
# Call your application logic - enrich_span happens inside
result = process_query(inputs["query"], tracer)
return {"answer": result}
def process_query(query: str, tracer: HoneyHiveTracer) -> str:
"""Application logic that enriches spans.
Call enrich_span from within your actual processing functions,
not directly in the evaluation function.
"""
# Do some processing
result = call_llm(query)
# Enrich the span with metrics from within this function
tracer.enrich_span(
metrics={"processing_time": 0.5, "token_count": 150},
metadata={"model": "gpt-4", "temperature": 0.7}
)
return result
# The tracer is automatically provided by evaluate()
result = evaluate(
function=my_function,
dataset=dataset,
name="experiment-v1"
)
Important
The
tracerparameter is optional - only add it if neededThe tracer is automatically injected by
evaluate()Use it to call
enrich_session()or access the tracer instanceEach datapoint gets its own tracer instance (multi-instance architecture)
Without tracer parameter (simpler):
def simple_function(datapoint: Dict[str, Any]) -> Dict[str, Any]:
"""Function without tracer access."""
inputs = datapoint.get("inputs", {})
return {"answer": process_query(inputs["query"])}
How do I trace third-party library calls in my evaluation?
Added in version 1.0: The evaluate() function now supports the instrumentors parameter.
Use the instrumentors Parameter for Automatic Tracing
If your evaluation function uses third-party libraries (OpenAI, Anthropic, Google ADK, LangChain, etc.), you can automatically trace their calls by passing instrumentor factory functions:
from typing import Any, Dict
from honeyhive.experiments import evaluate
from openinference.instrumentation.openai import OpenAIInstrumentor
def my_function(datapoint: Dict[str, Any]) -> Dict[str, Any]:
"""Evaluation function using OpenAI."""
inputs = datapoint.get("inputs", {})
# OpenAI calls will be automatically traced
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": inputs["prompt"]}]
)
return {"answer": response.choices[0].message.content}
# Pass instrumentor factories - each datapoint gets its own instance
result = evaluate(
function=my_function,
dataset=dataset,
instrumentors=[lambda: OpenAIInstrumentor()], # Factory function
name="openai-traced-experiment"
)
Important
Why Factory Functions?
The instrumentors parameter accepts factory functions (callables that return instrumentor instances), not instrumentor instances directly. This ensures each datapoint gets its own isolated instrumentor instance, preventing trace routing issues in concurrent processing.
Correct:
instrumentors=[lambda: OpenAIInstrumentor()]Incorrect:
instrumentors=[OpenAIInstrumentor()]
Multiple Instrumentors:
from openinference.instrumentation.openai import OpenAIInstrumentor
from openinference.instrumentation.langchain import LangChainInstrumentor
result = evaluate(
function=my_function,
dataset=dataset,
instrumentors=[
lambda: OpenAIInstrumentor(),
lambda: LangChainInstrumentor(),
],
name="multi-instrumented-experiment"
)
Google ADK Example:
from openinference.instrumentation.google_adk import GoogleADKInstrumentor
from google.adk.agents import Agent
from google.adk.runners import Runner
async def run_adk_agent(datapoint: Dict[str, Any]) -> Dict[str, Any]:
"""Run Google ADK agent - calls are automatically traced."""
inputs = datapoint.get("inputs", {})
agent = Agent(name="my_agent", model="gemini-2.0-flash", ...)
runner = Runner(agent=agent, ...)
# ADK agent calls will be traced
response = await runner.run_async(...)
return {"response": response}
result = evaluate(
function=run_adk_agent,
dataset=dataset,
instrumentors=[lambda: GoogleADKInstrumentor()],
name="adk-agent-evaluation"
)
Note
How it works:
Each datapoint gets its own tracer instance (multi-instance architecture)
For each datapoint, the SDK creates fresh instrumentor instances from your factories
Instrumentors are configured with the datapoint’s tracer provider via
instrumentor.instrument(tracer_provider=tracer.provider)This ensures all traces from that datapoint are routed to the correct session
Supported Instrumentors:
Any OpenInference-compatible instrumentor works with this pattern:
openinference.instrumentation.openai.OpenAIInstrumentoropeninference.instrumentation.anthropic.AnthropicInstrumentoropeninference.instrumentation.google_adk.GoogleADKInstrumentoropeninference.instrumentation.langchain.LangChainInstrumentoropeninference.instrumentation.llama_index.LlamaIndexInstrumentorAnd many more…
My experiments are too slow on large datasets
Use max_workers for Parallel Processing
# Slow: Sequential processing (default)
result = evaluate(
function=my_function,
dataset=large_dataset, # 1000 items
api_key="your-api-key",
project="your-project"
)
# Takes: ~1000 seconds if each item takes 1 second
# Fast: Parallel processing
result = evaluate(
function=my_function,
dataset=large_dataset, # 1000 items
max_workers=20, # Process 20 items simultaneously
api_key="your-api-key",
project="your-project"
)
# Takes: ~50 seconds (20x faster)
Choosing max_workers:
# Conservative (good for API rate limits)
max_workers=5
# Balanced (good for most cases)
max_workers=10
# Aggressive (fast but watch rate limits)
max_workers=20
How do I avoid hardcoding credentials?
Use Environment Variables
import os
# Set environment variables
os.environ["HH_API_KEY"] = "your-api-key"
os.environ["HH_PROJECT"] = "your-project"
# Now you can omit api_key and project
result = evaluate(
function=my_function,
dataset=dataset,
name="Experiment v1"
)
Or use a .env file:
# .env file
HH_API_KEY=your-api-key
HH_PROJECT=your-project
HH_SOURCE=dev # Optional: environment identifier
from dotenv import load_dotenv
load_dotenv()
# Credentials loaded automatically
result = evaluate(
function=my_function,
dataset=dataset,
name="Experiment v1"
)
How should I name my experiments?
Use Descriptive, Versioned Names
# ❌ Bad: Generic names
name="test"
name="experiment"
name="run1"
# ✅ Good: Descriptive names
name="gpt-3.5-baseline-v1"
name="improved-prompt-v2"
name="rag-with-reranking-v1"
name="production-candidate-2024-01-15"
Naming Convention:
# Format: {change-description}-{version}
evaluate(
function=baseline_function,
dataset=dataset,
name="gpt-3.5-baseline-v1",
api_key="your-api-key",
project="your-project"
)
evaluate(
function=improved_function,
dataset=dataset,
name="gpt-4-improved-v1", # Easy to compare
api_key="your-api-key",
project="your-project"
)
How do I access experiment results in code?
Use the Returned EvaluationResult Object
result = evaluate(
function=my_function,
dataset=dataset,
api_key="your-api-key",
project="your-project"
)
# Access run information
print(f"Run ID: {result.run_id}")
print(f"Status: {result.status}")
print(f"Dataset ID: {result.dataset_id}")
# Access session IDs (one per datapoint)
print(f"Session IDs: {result.session_ids}")
# Access evaluation data
print(f"Results: {result.data}")
# Export to JSON
result.to_json() # Saves to {suite_name}.json
I want to see what’s happening during evaluation
Enable Verbose Output
result = evaluate(
function=my_function,
dataset=dataset,
verbose=True, # Show progress
api_key="your-api-key",
project="your-project"
)
# Output:
# Processing datapoint 1/10...
# Processing datapoint 2/10...
# ...
Show me a complete real-world example
Question Answering Pipeline (v1.0)
from typing import Any, Dict
from honeyhive.experiments import evaluate
import openai
import os
# Setup
os.environ["HH_API_KEY"] = "your-honeyhive-key"
os.environ["HH_PROJECT"] = "qa-system"
openai.api_key = "your-openai-key"
# Define function to test
def qa_pipeline(datapoint: Dict[str, Any]) -> Dict[str, Any]:
"""Answer questions using GPT-4.
Args:
datapoint: Contains 'inputs' and 'ground_truth'
Returns:
Dictionary with answer, model, and token count
"""
client = openai.OpenAI()
inputs = datapoint.get("inputs", {})
question = inputs["question"]
context = inputs.get("context", "")
prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.0
)
return {
"answer": response.choices[0].message.content,
"model": "gpt-4",
"tokens": response.usage.total_tokens
}
# Create test dataset
dataset = [
{
"inputs": {
"question": "What is machine learning?",
"context": "ML is a subset of AI"
},
"ground_truth": {
"answer": "Machine learning is a subset of artificial intelligence..."
}
},
{
"inputs": {
"question": "What is deep learning?",
"context": "DL uses neural networks"
},
"ground_truth": {
"answer": "Deep learning uses neural networks..."
}
}
]
# Run experiment
result = evaluate(
function=qa_pipeline,
dataset=dataset,
name="qa-gpt4-baseline-v1",
max_workers=5,
verbose=True
)
print(f"✅ Experiment complete!")
print(f"📊 Run ID: {result.run_id}")
print(f"🔗 View in dashboard: https://app.honeyhive.ai/projects/qa-system")
See Also
Creating Evaluators - Add metrics to your experiments
Using Datasets in Experiments - Use datasets from HoneyHive UI
Comparing Experiments - Compare multiple experiment runs
Core Functions - Complete evaluate() API reference