Evaluators
Decorator-based system for defining custom quality checks and evaluators.
Overview
Evaluators assess the quality of LLM outputs. HoneyHive uses a modern decorator-based approach instead of class inheritance, making evaluators simpler and more flexible.
Key Features:
Simple function-based definitions
Support for both sync and async evaluators
Flexible return formats (dict, float, bool)
Automatic metric aggregation
Per-evaluator configuration
@evaluator Decorator
- @evaluator(func=None, *, name=None, settings=None)
Decorator to mark a function as a synchronous evaluator.
- Parameters:
func (Optional[Callable]) – Function to decorate (when used as
@evaluatorwithout parentheses).name (Optional[str]) – Optional custom name for the evaluator. Defaults to function name.
settings (Optional[EvaluatorSettings]) – Optional evaluator configuration.
- Returns:
Decorated evaluator function.
- Return type:
Callable
Signature Requirements:
Your evaluator function must accept these parameters:
def my_evaluator(outputs, inputs, ground_truth): Args: outputs: Dict returned by your function inputs: Dict from datapoint["inputs"] ground_truth: Dict from datapoint["ground_truth"] (optional) Returns: Dict with "score" and optional metrics, or float (interpreted as score), or bool (1.0 if True, 0.0 if False) return {"score": 0.9, "passed": True}
Basic Usage (No Arguments)
from honeyhive.experiments import evaluator @evaluator def accuracy_check(outputs, inputs, ground_truth): """Check if output matches expected result.""" return { "score": 1.0 if outputs == ground_truth else 0.0, "passed": outputs == ground_truth }
With Custom Name
@evaluator(name="custom_accuracy_v2") def accuracy_check(outputs, inputs, ground_truth): return {"score": calculate_score(outputs, ground_truth)}
With Settings
from honeyhive.experiments import evaluator, EvaluatorSettings @evaluator(settings=EvaluatorSettings( threshold=0.8, weight=2.0, enabled=True )) def weighted_accuracy(outputs, inputs, ground_truth): score = calculate_accuracy(outputs, ground_truth) return {"score": score}
Return Formats
Evaluators can return various formats:
# Dict with score and metadata (RECOMMENDED) @evaluator def detailed_evaluator(outputs, inputs, ground_truth): return { "score": 0.85, "passed": True, "confidence": 0.95, "reason": "Output matches expected pattern" } # Simple float score @evaluator def simple_score(outputs, inputs, ground_truth): return 0.85 # Interpreted as {"score": 0.85} # Boolean (1.0 if True, 0.0 if False) @evaluator def pass_fail(outputs, inputs, ground_truth): return outputs["answer"] == ground_truth["answer"]
@aevaluator Decorator
- @aevaluator(func=None, *, name=None, settings=None)
Decorator to mark a function as an asynchronous evaluator.
Same parameters and behavior as
@evaluator, but for async functions.- Parameters:
func (Optional[Callable]) – Async function to decorate.
name (Optional[str]) – Optional custom name for the evaluator.
settings (Optional[EvaluatorSettings]) – Optional evaluator configuration.
- Returns:
Decorated async evaluator function.
- Return type:
Callable
Basic Async Evaluator
from honeyhive.experiments import aevaluator import httpx @aevaluator async def external_api_check(outputs, inputs, ground_truth): """Call external API to validate output.""" async with httpx.AsyncClient() as client: response = await client.post( "https://api.example.com/validate", json={"output": outputs, "expected": ground_truth} ) data = response.json() return { "score": data["score"], "api_confidence": data["confidence"] }
Multiple Async Operations
@aevaluator async def multi_source_validation(outputs, inputs, ground_truth): """Validate against multiple external sources.""" async with httpx.AsyncClient() as client: # Run validations concurrently results = await asyncio.gather( client.post("https://api1.com/check", json=outputs), client.post("https://api2.com/check", json=outputs), client.post("https://api3.com/check", json=outputs), ) scores = [r.json()["score"] for r in results] avg_score = sum(scores) / len(scores) return { "score": avg_score, "individual_scores": scores, "sources_checked": len(scores) }
Mixing Sync and Async
You can use both sync and async evaluators together:
@evaluator def fast_local_check(outputs, inputs, ground_truth): """Quick local validation.""" return {"score": local_validation(outputs)} @aevaluator async def slow_api_check(outputs, inputs, ground_truth): """Slower external validation.""" result = await external_api.validate(outputs) return {"score": result.score} # Use both in evaluate() result = evaluate( function=my_function, dataset=test_data, evaluators=[fast_local_check, slow_api_check], # Mixed! api_key="key", project="project" )
EvaluatorSettings
- class EvaluatorSettings
Configuration for individual evaluators.
- Parameters:
threshold (Optional[float]) – Minimum score to consider as “passed”.
weight (Optional[float]) – Relative weight for aggregation (default: 1.0).
enabled (bool) – Whether this evaluator is active (default: True).
timeout (Optional[float]) – Maximum execution time in seconds.
retry_count (int) – Number of retries on failure.
Usage Example
from honeyhive.experiments import evaluator, EvaluatorSettings @evaluator(settings=EvaluatorSettings( threshold=0.7, # Pass if score >= 0.7 weight=2.0, # Double weight in aggregation enabled=True, # Can disable without removing timeout=5.0, # 5 second timeout retry_count=3 # Retry up to 3 times )) def critical_evaluator(outputs, inputs, ground_truth): return {"score": validate(outputs)}
EvalResult
- class EvalResult
Result object returned by evaluators (internal representation).
- Parameters:
Note
This class is used internally. Your evaluator functions return dicts, which are automatically converted to
EvalResultobjects.
Aggregation Functions
When multiple evaluators run on the same datapoint, their scores are aggregated.
- mean(scores)
Calculate arithmetic mean of scores.
>>> mean([0.8, 0.9, 0.7]) 0.8
- median(scores)
Calculate median score.
>>> median([0.8, 0.9, 0.7, 0.6, 1.0]) 0.8
- mode(scores)
Calculate mode (most common score).
>>> mode([0.8, 0.8, 0.9, 0.7, 0.8]) 0.8
Evaluator Patterns
1. Exact Match
@evaluator
def exact_match(outputs, inputs, ground_truth):
"""Check for exact string match."""
return {
"score": 1.0 if outputs["answer"] == ground_truth["answer"] else 0.0,
"matched": outputs["answer"] == ground_truth["answer"]
}
2. Semantic Similarity
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
@evaluator
def semantic_similarity(outputs, inputs, ground_truth):
"""Calculate semantic similarity between output and expected."""
output_embedding = model.encode([outputs["answer"]])
expected_embedding = model.encode([ground_truth["answer"]])
similarity = cosine_similarity(output_embedding, expected_embedding)[0][0]
return {
"score": float(similarity),
"passed": similarity >= 0.8,
"similarity": float(similarity)
}
3. Length Validation
@evaluator
def length_check(outputs, inputs, ground_truth):
"""Validate output length is within acceptable range."""
text = outputs.get("answer", "")
word_count = len(text.split())
min_words = inputs.get("min_words", 10)
max_words = inputs.get("max_words", 100)
in_range = min_words <= word_count <= max_words
return {
"score": 1.0 if in_range else 0.0,
"word_count": word_count,
"in_range": in_range,
"min_words": min_words,
"max_words": max_words
}
4. Multi-Criteria Evaluation
@evaluator
def comprehensive_quality(outputs, inputs, ground_truth):
"""Evaluate multiple quality criteria."""
answer = outputs.get("answer", "")
# Individual criteria
has_answer = len(answer) > 0
correct_length = 50 <= len(answer) <= 200
no_profanity = not contains_profanity(answer)
factually_correct = check_facts(answer, ground_truth)
# Weighted score
criteria_scores = {
"has_answer": 1.0 if has_answer else 0.0,
"correct_length": 1.0 if correct_length else 0.5,
"no_profanity": 1.0 if no_profanity else 0.0,
"factually_correct": 1.0 if factually_correct else 0.0
}
# Average with weights
weights = {"has_answer": 1, "correct_length": 1, "no_profanity": 2, "factually_correct": 3}
total_weight = sum(weights.values())
weighted_sum = sum(criteria_scores[k] * weights[k] for k in criteria_scores)
final_score = weighted_sum / total_weight
return {
"score": final_score,
"criteria_scores": criteria_scores,
"all_passed": all(criteria_scores.values())
}
5. LLM-as-Judge
import openai
@evaluator
def llm_judge(outputs, inputs, ground_truth):
"""Use an LLM to judge output quality."""
client = openai.OpenAI()
prompt = f"""
Evaluate the following answer for accuracy and relevance.
Question: {inputs['query']}
Expected Answer: {ground_truth['answer']}
Actual Answer: {outputs['answer']}
Provide a score from 0.0 to 1.0 and explain your reasoning.
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
# Parse response (assumes structured format)
result = parse_llm_response(response.choices[0].message.content)
return {
"score": result["score"],
"reasoning": result["explanation"]
}
Best Practices
1. Keep Evaluators Pure
Avoid side effects in evaluators:
# GOOD
@evaluator
def pure_evaluator(outputs, inputs, ground_truth):
return {"score": calculate_score(outputs, ground_truth)}
# BAD - Has side effects
@evaluator
def impure_evaluator(outputs, inputs, ground_truth):
database.save_result(outputs) # Side effect!
return {"score": 0.9}
2. Handle Missing Data Gracefully
@evaluator
def robust_evaluator(outputs, inputs, ground_truth):
# Handle missing keys
answer = outputs.get("answer", "")
expected = ground_truth.get("answer", "") if ground_truth else ""
if not answer:
return {"score": 0.0, "error": "No answer provided"}
if not expected:
return {"score": 0.5, "warning": "No ground truth available"}
return {"score": compare(answer, expected)}
3. Provide Detailed Metadata
@evaluator
def detailed_evaluator(outputs, inputs, ground_truth):
score = calculate_score(outputs, ground_truth)
return {
"score": score,
"passed": score >= 0.8,
# Add debugging info
"output_length": len(str(outputs)),
"processing_method": "semantic_similarity",
"confidence": calculate_confidence(score),
"suggestions": generate_improvements(outputs, ground_truth) if score < 0.8 else None
}
4. Use Timeouts for External Calls
import asyncio
@aevaluator
async def api_evaluator_with_timeout(outputs, inputs, ground_truth):
try:
# Set timeout
async with asyncio.timeout(5.0):
result = await external_api.validate(outputs)
return {"score": result.score}
except asyncio.TimeoutError:
return {"score": 0.0, "error": "API timeout"}
See Also
Core Functions - Run evaluators with evaluate()
Data Models - Result data models
Evaluation & Analysis Guides - Evaluator tutorial
Evaluation & Analysis Guides - Evaluator patterns