Evaluators Reference
Complete reference for all evaluation classes and functions in HoneyHive.
Base Classes
BaseEvaluator
Base class for all custom evaluators.
- class honeyhive.evaluation.evaluators.BaseEvaluator(name, **kwargs)[source]
Bases:
objectBase class for custom evaluators.
Example
from honeyhive.evaluation import BaseEvaluator
class CustomEvaluator(BaseEvaluator):
def __init__(self, threshold=0.5, **kwargs):
super().__init__("custom_evaluator", **kwargs)
self.threshold = threshold
def evaluate(self, inputs, outputs, ground_truth=None, **kwargs):
# Custom evaluation logic
score = self._compute_score(outputs)
return {
"score": score,
"passed": score >= self.threshold
}
Built-in Evaluators
ExactMatchEvaluator
Evaluates exact string matching between expected and actual outputs.
- class honeyhive.evaluation.evaluators.ExactMatchEvaluator(**kwargs)[source]
Bases:
BaseEvaluatorEvaluator for exact string matching.
- Parameters:
kwargs (Any)
Description
The ExactMatchEvaluator checks if the actual output exactly matches the expected output. String comparisons are case-insensitive and whitespace is stripped.
Example
from honeyhive.evaluation import ExactMatchEvaluator
evaluator = ExactMatchEvaluator()
result = evaluator.evaluate(
inputs={"expected": "The answer is 42"},
outputs={"response": "The answer is 42"}
)
# Returns: {"exact_match": 1.0, "expected": "...", "actual": "..."}
# Case-insensitive matching
result = evaluator.evaluate(
inputs={"expected": "hello"},
outputs={"response": "HELLO"}
)
# Returns: {"exact_match": 1.0, ...}
F1ScoreEvaluator
Evaluates F1 score for text similarity.
- class honeyhive.evaluation.evaluators.F1ScoreEvaluator(**kwargs)[source]
Bases:
BaseEvaluatorEvaluator for F1 score calculation.
- Parameters:
kwargs (Any)
Description
The F1ScoreEvaluator computes the F1 score between predicted and ground truth text based on word-level token overlap. It calculates precision and recall and combines them into an F1 score.
Formula
precision = |predicted_words ∩ ground_truth_words| / |predicted_words|
recall = |predicted_words ∩ ground_truth_words| / |ground_truth_words|
f1_score = 2 * (precision * recall) / (precision + recall)
Example
from honeyhive.evaluation import F1ScoreEvaluator
evaluator = F1ScoreEvaluator()
result = evaluator.evaluate(
inputs={"expected": "the quick brown fox"},
outputs={"response": "the fast brown fox"}
)
# Returns: {"f1_score": 0.75} # 3 out of 4 words match
SemanticSimilarityEvaluator
Evaluates semantic similarity using embeddings.
- class honeyhive.evaluation.evaluators.SemanticSimilarityEvaluator(**kwargs)[source]
Bases:
BaseEvaluatorEvaluator for semantic similarity using basic heuristics.
- Parameters:
kwargs (Any)
Description
The SemanticSimilarityEvaluator uses embeddings to compute semantic similarity between texts. This is more sophisticated than exact match or F1 score as it understands meaning rather than just token overlap.
Example
from honeyhive.evaluation import SemanticSimilarityEvaluator
evaluator = SemanticSimilarityEvaluator(
embedding_model="text-embedding-ada-002",
threshold=0.8
)
result = evaluator.evaluate(
inputs={"expected": "The weather is nice today"},
outputs={"response": "It's a beautiful day outside"}
)
# Returns: {"similarity": 0.85, "passed": True}
Evaluation Decorators
evaluator
Decorator for defining synchronous evaluators.
- honeyhive.evaluation.evaluators.evaluator(_name=None, _session_id=None, **_kwargs)[source]
Decorator for synchronous evaluation functions.
Description
The evaluator decorator converts a regular function into an evaluator that can be
used with the HoneyHive evaluation system.
Example
from honeyhive import evaluator
@evaluator
def length_check(inputs, outputs, ground_truth=None, min_length=10):
"""Check if output meets minimum length requirement."""
text = outputs.get("response", "")
length = len(text)
return {
"length": length,
"meets_minimum": length >= min_length,
"score": 1.0 if length >= min_length else 0.0
}
# Use in evaluation
from honeyhive import evaluate
results = evaluate(
data=[{"input": "test"}],
task=lambda x: {"response": "short"},
evaluators=[length_check]
)
aevaluator
Decorator for defining asynchronous evaluators.
EvaluatorMeta
Metaclass for evaluator type handling.
TerminalColors
Terminal color constants for formatted output.
- class honeyhive.experiments.evaluators.TerminalColors[source]
Bases:
objectANSI terminal color codes for output formatting.
- HEADER = '\x1b[95m'
- OKBLUE = '\x1b[94m'
- OKCYAN = '\x1b[96m'
- OKGREEN = '\x1b[92m'
- WARNING = '\x1b[93m'
- FAIL = '\x1b[91m'
- ENDC = '\x1b[0m'
- BOLD = '\x1b[1m'
- UNDERLINE = '\x1b[4m'
Description
The aevaluator decorator is used for async evaluators that need to make
asynchronous calls (e.g., API calls for LLM-based evaluation).
Example
from honeyhive import aevaluator
import aiohttp
@aevaluator
async def llm_grader(inputs, outputs, ground_truth=None):
"""Use an LLM to grade the output."""
async with aiohttp.ClientSession() as session:
async with session.post(
"https://api.openai.com/v1/chat/completions",
json={
"model": "gpt-4",
"messages": [{
"role": "user",
"content": f"Grade this output: {outputs['response']}"
}]
}
) as response:
result = await response.json()
grade = parse_grade(result)
return {
"grade": grade,
"score": grade / 100.0
}
Data Models
EvaluationResult
Result model for evaluation outputs.
- class honeyhive.evaluation.evaluators.EvaluationResult(score, metrics, feedback=None, metadata=None, evaluation_id=<factory>, timestamp=None)[source]
Bases:
objectResult of an evaluation.
- Parameters:
Fields
score (float): Numeric score from evaluation
metrics (Dict[str, Any]): Additional metrics
feedback (Optional[str]): Text feedback
metadata (Optional[Dict[str, Any]]): Additional metadata
evaluation_id (str): Unique ID for this evaluation
timestamp (Optional[str]): Timestamp of evaluation
Example
from honeyhive.evaluation import EvaluationResult
result = EvaluationResult(
score=0.85,
metrics={"accuracy": 0.9, "latency": 250},
feedback="Good response, minor improvements possible",
metadata={"model": "gpt-4", "version": "1.0"}
)
EvaluationContext
Context information for evaluation runs.
- class honeyhive.evaluation.evaluators.EvaluationContext(project, source, session_id=None, metadata=None)[source]
Bases:
objectContext for evaluation runs.
Fields
project (str): Project name
source (str): Source of evaluation
session_id (Optional[str]): Session identifier
metadata (Optional[Dict[str, Any]]): Additional context
Example
from honeyhive.evaluation import EvaluationContext
context = EvaluationContext(
project="my-llm-app",
source="production",
session_id="session-123",
metadata={"user_id": "user-456"}
)
Evaluation Functions
evaluate
Main function for running evaluations.
- honeyhive.evaluation.evaluators.evaluate(prediction, ground_truth, metrics=None, **kwargs)[source]
Evaluate a prediction against ground truth.
Description
The evaluate function runs a set of evaluators on your task outputs,
collecting metrics and results for analysis.
Parameters
data (List[Dict]): Input data for evaluation
task (Callable): Function that produces outputs
evaluators (List): List of evaluator functions or objects
project (str, optional): Project name
run_name (str, optional): Name for this evaluation run
metadata (Dict, optional): Additional metadata
Returns
Dict containing: - results: List of evaluation results - metrics: Aggregated metrics - summary: Summary statistics
Example
from honeyhive import evaluate, evaluator
@evaluator
def check_length(inputs, outputs, min_words=5):
words = len(outputs["response"].split())
return {
"word_count": words,
"meets_minimum": words >= min_words,
"score": 1.0 if words >= min_words else 0.0
}
# Define your task
def my_task(input_data):
# Your LLM logic here
return {"response": "Generated response"}
# Run evaluation
results = evaluate(
data=[
{"prompt": "What is AI?"},
{"prompt": "Explain ML"},
],
task=my_task,
evaluators=[check_length],
project="my-project",
run_name="baseline-eval"
)
print(f"Average score: {results['metrics']['average_score']}")
print(f"Pass rate: {results['metrics']['pass_rate']}")
See Also
Experiments Module - Experiments API
Tutorial 5: Run Your First Experiment - Evaluation tutorial
Creating Evaluators - Creating custom evaluators
Best Practices - Evaluation best practices