Deprecation Notice
Warning
The ``evaluation`` module is deprecated and will be removed in version 2.0.0.
Please migrate to the experiments module for new features, better architecture,
and continued support.
Overview
The honeyhive.evaluation module has been superseded by honeyhive.experiments which provides:
Improved Architecture: Decorator-based evaluators instead of class inheritance
Backend Aggregation: Server-side metric aggregation for better performance
Enhanced Tracer Integration: Seamless integration with the multi-instance tracer
Better Type Safety: Pydantic v2 models with full validation
Cleaner API: Simpler, more intuitive function signatures
Deprecation Timeline
Version |
Status |
|---|---|
0.2.x (Current) |
|
1.x |
|
2.0.0 (Future) |
|
Migration Guide
Quick Migration Checklist
Update imports:
honeyhive.evaluation→honeyhive.experimentsReplace class-based evaluators with
@evaluatordecoratorUpdate
evaluate()function signatureUpdate result handling to use new models
Detailed Migration Steps
Step 1: Update Imports
# OLD
from honeyhive.evaluation import evaluate, BaseEvaluator, EvaluationResult
# NEW
from honeyhive.experiments import evaluate, evaluator, ExperimentResultSummary
Step 2: Convert Class-Based Evaluators to Decorators
# OLD - Class inheritance
from honeyhive.evaluation import BaseEvaluator
class AccuracyEvaluator(BaseEvaluator):
def __init__(self, threshold=0.8):
super().__init__("accuracy")
self.threshold = threshold
def evaluate(self, inputs, outputs, ground_truth):
score = calculate_accuracy(outputs, ground_truth)
return {
"score": score,
"passed": score >= self.threshold
}
# NEW - Decorator-based
from honeyhive.experiments import evaluator
@evaluator
def accuracy_evaluator(outputs, inputs, ground_truth):
"""Note: outputs is first parameter in new signature."""
score = calculate_accuracy(outputs, ground_truth)
threshold = 0.8 # Can use closures or default args
return {
"score": score,
"passed": score >= threshold
}
Step 3: Update evaluate() Function Calls
# OLD
from honeyhive.evaluation import evaluate
result = evaluate(
inputs=test_inputs,
outputs=test_outputs,
evaluators=[AccuracyEvaluator(), F1Evaluator()],
ground_truth=expected_outputs
)
# NEW
from honeyhive.experiments import evaluate
result = evaluate(
function=my_llm_function, # Your function to test
dataset=[
{"inputs": {...}, "ground_truth": {...}},
{"inputs": {...}, "ground_truth": {...}},
],
evaluators=[accuracy_evaluator, f1_evaluator], # Function refs
api_key="your-key",
project="your-project",
name="experiment-v1"
)
Step 4: Update Result Handling
# OLD
from honeyhive.evaluation import EvaluationResult
result = evaluate(...)
# Access results (old structure)
overall_score = result.score
metrics = result.metrics
# NEW
from honeyhive.experiments import ExperimentResultSummary
result = evaluate(...)
# Access results (new structure)
print(f"Run ID: {result.run_id}")
print(f"Status: {result.status}")
print(f"Success: {result.success}")
print(f"Passed: {len(result.passed)}")
print(f"Failed: {len(result.failed)}")
# Aggregated metrics
accuracy = result.metrics.get_metric("accuracy_evaluator")
all_metrics = result.metrics.get_all_metrics()
Step 5: Update Async Evaluators
# OLD - Async class method
class AsyncEvaluator(BaseEvaluator):
async def evaluate(self, inputs, outputs, ground_truth):
result = await external_api_call(outputs)
return {"score": result.score}
# NEW - @aevaluator decorator
from honeyhive.experiments import aevaluator
@aevaluator
async def async_evaluator(outputs, inputs, ground_truth):
result = await external_api_call(outputs)
return {"score": result.score}
Common Patterns
Pattern 1: Built-in Evaluators
# OLD
from honeyhive.evaluation.evaluators import (
ExactMatchEvaluator,
LengthEvaluator,
FactualAccuracyEvaluator
)
evaluators = [
ExactMatchEvaluator(),
LengthEvaluator(min_length=10, max_length=100),
FactualAccuracyEvaluator()
]
# NEW - Implement as decorator-based evaluators
from honeyhive.experiments import evaluator
@evaluator
def exact_match(outputs, inputs, ground_truth):
return {"score": 1.0 if outputs == ground_truth else 0.0}
@evaluator
def length_check(outputs, inputs, ground_truth):
length = len(str(outputs))
in_range = 10 <= length <= 100
return {"score": 1.0 if in_range else 0.0}
# Use external APIs for factual accuracy
@aevaluator
async def factual_accuracy(outputs, inputs, ground_truth):
result = await fact_check_api(outputs, ground_truth)
return {"score": result.accuracy}
evaluators = [exact_match, length_check, factual_accuracy]
Pattern 2: Evaluator with State
# OLD
class StatefulEvaluator(BaseEvaluator):
def __init__(self, model):
super().__init__("stateful")
self.model = model # Store state
def evaluate(self, inputs, outputs, ground_truth):
score = self.model.predict(outputs)
return {"score": score}
# NEW - Use closures or class methods with decorator
from honeyhive.experiments import evaluator
# Option 1: Closure
def create_stateful_evaluator(model):
@evaluator
def stateful_evaluator(outputs, inputs, ground_truth):
score = model.predict(outputs)
return {"score": score}
return stateful_evaluator
model = load_model()
my_evaluator = create_stateful_evaluator(model)
# Option 2: Class with __call__
class StatefulEvaluator:
def __init__(self, model):
self.model = model
@evaluator
def __call__(self, outputs, inputs, ground_truth):
score = self.model.predict(outputs)
return {"score": score}
my_evaluator = StatefulEvaluator(load_model())
Pattern 3: Batch Evaluation
# OLD
from honeyhive.evaluation import evaluate_batch
results = evaluate_batch(
inputs_list=batch_inputs,
outputs_list=batch_outputs,
evaluators=[evaluator1, evaluator2],
max_workers=4
)
# NEW - Use evaluate() with dataset
from honeyhive.experiments import evaluate
result = evaluate(
function=my_function,
dataset=test_dataset,
evaluators=[evaluator1, evaluator2],
max_workers=4,
api_key="key",
project="project"
)
Backward Compatibility Layer
The old evaluation module still works through a compatibility layer:
# This still works but shows deprecation warnings
from honeyhive.evaluation import evaluate, evaluator
# Internally redirects to honeyhive.experiments
result = evaluate(...)
Deprecation Warnings:
When you use the old module, you’ll see warnings like:
DeprecationWarning: honeyhive.evaluation.evaluate is deprecated.
Please use honeyhive.experiments.evaluate instead.
The evaluation module will be removed in version 2.0.0.
Breaking Changes
Parameter Order Change
Evaluator function signature changed:
# OLD
def evaluator(inputs, outputs, ground_truth):
pass
# NEW - outputs comes first
def evaluator(outputs, inputs, ground_truth):
pass
evaluate() Signature Change
The main evaluate function has a completely new signature:
# OLD
evaluate(inputs, outputs, evaluators, ground_truth=None)
# NEW
evaluate(
function, # NEW: function to test
dataset, # NEW: combined inputs + ground_truth
evaluators,
api_key, # NEW: required
project, # NEW: required
name=None,
max_workers=1,
aggregate_function="average",
verbose=False
)
Return Type Change
# OLD
result: EvaluationResult = evaluate(...)
result.score # Overall score
result.metrics # Dict of metrics
result.passed # Bool
# NEW
result: ExperimentResultSummary = evaluate(...)
result.run_id # Unique run ID
result.status # ExperimentRunStatus enum
result.success # Bool
result.passed # List[str] of passed datapoint IDs
result.failed # List[str] of failed datapoint IDs
result.metrics # AggregatedMetrics object
Support & Help
Documentation:
Experiments Module - Experiments module overview
Evaluation & Analysis Guides - Updated tutorial
Migration Guide: v0.1.0+ Architecture - Complete migration guide
Common Issues:
Import Error: Make sure you’ve updated imports to
honeyhive.experimentsParameter Order: Remember
outputscomes first in new evaluatorsMissing api_key/project: These are now required for
evaluate()Result Structure: Use new
ExperimentResultSummarystructure
Getting Help:
GitHub Issues: https://github.com/honeyhive/python-sdk/issues
Documentation: https://docs.honeyhive.ai
Community: https://discord.gg/honeyhive
See Also
Experiments Module - New experiments module
Evaluators - Decorator-based evaluators
Migration Guide: v0.1.0+ Architecture - Migration guide