Evaluation Data Models
Note
Technical specification for HoneyHive evaluation data structures
This document defines the exact data models and formats used for evaluations in the HoneyHive SDK.
Evaluations assess the quality, accuracy, and performance of LLM outputs using various metrics and criteria.
Core Evaluation Model
- class Evaluation
The primary evaluation data structure.
- evaluation_id: str
Unique identifier for the evaluation.
Format: UUID v4 string Example:
"eval_01234567-89ab-cdef-0123-456789abcdef"Required: Auto-generated by SDK
- target_event_id: str
ID of the event being evaluated.
Format: UUID v4 string Example:
"evt_01234567-89ab-cdef-0123-456789abcdef"Required: Yes
- evaluator_name: str
Name of the evaluator used.
Examples:
"factual_accuracy","relevance","toxicity","coherence"Required: Yes
- evaluator_version: str | None
Version of the evaluator.
Format: Semantic version string Example:
"v1.2.0"Required: No
- score: float | int | bool | str
The evaluation score.
Types: -
float: Continuous scores (0.0-1.0, 0-100, etc.) -int: Discrete scores (1-5, 0-10, etc.) -bool: Binary pass/fail evaluations -str: Categorical scores (“good”, “bad”, “excellent”)Examples:
0.85,4,True,"excellent"Required: No (some evaluators may only provide explanations)
- explanation: str | None
Human-readable explanation of the evaluation.
Example:
"The response accurately answers the question with relevant supporting evidence and maintains appropriate tone."Required: No
- confidence: float | None
Confidence level in the evaluation (0.0-1.0).
Range: 0.0 (no confidence) to 1.0 (full confidence) Example:
0.92Required: No
- criteria: Dict[str, Any] | None
Evaluation criteria and parameters used.
Structure: Evaluator-specific criteria Example:
{ "accuracy_threshold": 0.8, "check_citations": true, "reference_sources": ["wikipedia", "academic_papers"], "language": "en", "domain": "science" }
Required: No
- metadata: Dict[str, Any] | None
Additional evaluation metadata.
Structure: Key-value pairs of contextual information Example:
{ "evaluator_model": "gpt-4", "evaluation_prompt_version": "v2.1", "reference_data_version": "2024-01-15", "human_annotator_id": "annotator_123", "evaluation_environment": "production" }
Required: No
- timestamp: datetime
When the evaluation was performed.
Format: ISO 8601 timestamp Example:
"2024-01-15T10:30:45.123456Z"Required: Auto-generated by SDK
- duration_ms: float | None
Time taken to perform the evaluation.
Unit: Milliseconds Example:
1250.5Required: No
- cost_usd: float | None
Cost to perform the evaluation (if applicable).
Unit: US Dollars Example:
0.001Required: No
Built-in Evaluator Models
Quality Evaluator:
- class QualityEvaluation
Evaluates overall response quality.
Inherits: All fields from
EvaluationSpecific Fields:
- dimensions: Dict[str, float]
Individual quality dimensions.
Structure:
{ "relevance": 0.85, "coherence": 0.92, "clarity": 0.78, "completeness": 0.88, "accuracy": 0.95 }
Example:
{ "evaluation_id": "eval_quality_001", "evaluator_name": "quality", "score": 0.876, "explanation": "High quality response with good relevance and accuracy", "dimensions": { "relevance": 0.85, "coherence": 0.92, "clarity": 0.78, "completeness": 0.88, "accuracy": 0.95 } }
Factual Accuracy Evaluator:
- class FactualAccuracyEvaluation
Evaluates factual correctness of responses.
Specific Fields:
- factual_claims: List[Dict[str, Any]]
Individual factual claims and their verification.
Structure:
[ { "claim": "Paris is the capital of France", "verified": true, "confidence": 0.99, "sources": ["wikipedia.org/Paris"] }, { "claim": "The population is 12 million", "verified": false, "confidence": 0.85, "correct_value": "2.16 million", "sources": ["insee.fr"] } ]
Example:
{ "evaluation_id": "eval_fact_001", "evaluator_name": "factual_accuracy", "score": 0.75, "explanation": "Most facts are correct but population figure is outdated", "factual_claims": [ { "claim": "Paris is the capital of France", "verified": true, "confidence": 0.99 } ], "citation_accuracy": 0.8, "hallucination_detected": false }
Toxicity Evaluator:
- class ToxicityEvaluation
Evaluates content toxicity and safety.
Specific Fields:
- toxicity_categories: Dict[str, float]
Scores for different toxicity categories.
Structure:
{ "hate_speech": 0.02, "harassment": 0.01, "violence": 0.03, "sexual_content": 0.00, "profanity": 0.05, "identity_attack": 0.01 }
Example:
{ "evaluation_id": "eval_toxic_001", "evaluator_name": "toxicity", "score": 0.05, "explanation": "Content is safe with minimal profanity", "toxicity_categories": { "hate_speech": 0.02, "harassment": 0.01, "violence": 0.03, "profanity": 0.05 }, "overall_toxicity": 0.05, "content_warnings": [] }
Relevance Evaluator:
- class RelevanceEvaluation
Evaluates response relevance to the query.
Specific Fields:
Example:
{ "evaluation_id": "eval_rel_001", "evaluator_name": "relevance", "score": 0.88, "explanation": "Response directly addresses the question with comprehensive information", "query_intent_match": 0.92, "topic_alignment": 0.85, "information_completeness": 0.87 }
Length Evaluator:
- class LengthEvaluation
Evaluates response length appropriateness.
Specific Fields:
- expected_length_range: Dict[str, int]
Expected length range.
Structure:
{ "min_words": 50, "max_words": 200, "optimal_words": 125 }
- length_appropriateness: str
Assessment of length appropriateness.
Values:
"too_short","appropriate","too_long"
Example:
{ "evaluation_id": "eval_len_001", "evaluator_name": "length", "score": 0.9, "explanation": "Response length is appropriate for the question complexity", "character_count": 543, "word_count": 87, "sentence_count": 6, "expected_length_range": { "min_words": 50, "max_words": 150 }, "length_appropriateness": "appropriate" }
Custom Evaluator Model
- class CustomEvaluation
Template for custom evaluators.
Inherits: All fields from
EvaluationCustom Fields:
Custom evaluators can add domain-specific fields:
# Example: Medical accuracy evaluator { "evaluation_id": "eval_med_001", "evaluator_name": "medical_accuracy", "score": 0.85, "explanation": "Medically sound advice with appropriate caveats", "medical_fields": { "diagnosis_accuracy": 0.9, "treatment_appropriateness": 0.8, "contraindication_awareness": 0.95, "disclaimer_present": True }, "risk_level": "low", "regulatory_compliance": True, "citations_medical_literature": 3 }
Multi-Evaluator Results
- class MultiEvaluationResult
Results from running multiple evaluators on the same target.
- evaluations: List[Evaluation]
List of individual evaluation results.
- summary_method: str | None
Method used for score aggregation.
Values:
"weighted_average","simple_average","minimum","custom"
- weights: Dict[str, float] | None
Weights used for aggregation.
Example:
{ "factual_accuracy": 0.4, "relevance": 0.3, "quality": 0.2, "toxicity": 0.1 }
Example:
{ "target_event_id": "evt_12345", "evaluations": [ { "evaluator_name": "factual_accuracy", "score": 0.92 }, { "evaluator_name": "relevance", "score": 0.88 }, { "evaluator_name": "toxicity", "score": 0.05 } ], "summary_score": 0.89, "summary_method": "weighted_average", "weights": { "factual_accuracy": 0.5, "relevance": 0.3, "toxicity": 0.2 } }
Evaluation Batch Model
- class EvaluationBatch
Represents a batch of evaluations for efficient processing.
- evaluations: List[Evaluation]
List of evaluations in the batch.
Evaluation Configuration
- class EvaluationConfig
Configuration for evaluation runs.
Evaluation Metrics
Performance Metrics:
{
"evaluation_performance": {
"total_evaluations": 1000,
"successful_evaluations": 985,
"failed_evaluations": 15,
"success_rate": 0.985,
"average_duration_ms": 1250.5,
"p95_duration_ms": 3200.0,
"total_cost_usd": 0.15
}
}
Quality Metrics:
{
"evaluation_quality": {
"evaluator_agreement": 0.82,
"inter_evaluator_correlation": {
"quality_vs_relevance": 0.75,
"factual_accuracy_vs_quality": 0.68
},
"confidence_distribution": {
"high_confidence": 0.65,
"medium_confidence": 0.25,
"low_confidence": 0.10
}
}
}
Evaluation Serialization
JSON Format:
import json
from datetime import datetime
evaluation = {
"evaluation_id": "eval_123",
"target_event_id": "evt_456",
"evaluator_name": "quality",
"score": 0.85,
"timestamp": datetime.utcnow().isoformat() + "Z",
# ... other fields
}
json_data = json.dumps(evaluation, ensure_ascii=False, indent=2)
Pydantic Models:
from pydantic import BaseModel, Field
from typing import Optional, Dict, Any, Union
from datetime import datetime
class EvaluationModel(BaseModel):
evaluation_id: str = Field(..., description="Unique evaluation identifier")
target_event_id: str = Field(..., description="ID of evaluated event")
evaluator_name: str = Field(..., description="Name of evaluator")
score: Optional[Union[float, int, bool, str]] = Field(None, description="Evaluation score")
explanation: Optional[str] = Field(None, description="Score explanation")
confidence: Optional[float] = Field(None, ge=0.0, le=1.0, description="Confidence level")
timestamp: datetime = Field(default_factory=datetime.utcnow, description="Evaluation timestamp")
class Config:
json_encoders = {
datetime: lambda v: v.isoformat() + "Z"
}
Common Evaluation Patterns
A/B Testing Evaluations:
{
"evaluation_id": "eval_ab_001",
"evaluator_name": "ab_test_comparison",
"score": 0.65,
"explanation": "Variant A preferred in 65% of comparisons",
"comparison_data": {
"variant_a_event_id": "evt_variant_a",
"variant_b_event_id": "evt_variant_b",
"preference_score": 0.65,
"confidence_interval": [0.58, 0.72],
"sample_size": 1000,
"statistical_significance": true
}
}
Human vs AI Evaluations:
{
"evaluation_id": "eval_human_ai_001",
"evaluator_name": "human_ai_comparison",
"human_evaluation": {
"annotator_id": "human_123",
"score": 0.8,
"explanation": "Good response but could be more concise"
},
"ai_evaluation": {
"model": "gpt-4",
"score": 0.85,
"explanation": "High quality response with good structure"
},
"agreement_score": 0.92,
"discrepancy_analysis": "Minor difference in verbosity assessment"
}
Temporal Evaluations:
{
"evaluation_id": "eval_temporal_001",
"evaluator_name": "temporal_quality",
"current_score": 0.85,
"historical_scores": [0.78, 0.82, 0.85],
"trend": "improving",
"change_rate": 0.02,
"stability_score": 0.91
}
Best Practices
Evaluation Design Guidelines:
Clear Metrics: Define clear, measurable evaluation criteria
Consistent Scoring: Use consistent scoring scales across evaluators
Rich Context: Provide explanations and confidence scores
Error Handling: Gracefully handle evaluation failures
Performance: Optimize for evaluation speed and cost
Validation: Validate evaluator performance with ground truth data
Data Quality:
Score Normalization: Ensure scores are on consistent scales
Missing Data: Handle cases where evaluations cannot be performed
Confidence Reporting: Always report confidence when available
Metadata Capture: Include relevant evaluation context
Performance Optimization:
Batch Processing: Use batched evaluations for efficiency
Caching: Cache evaluation results for repeated queries
Parallel Execution: Run independent evaluators in parallel
Timeout Handling: Set appropriate timeouts for evaluators
See Also
Event Data Models - Event data models being evaluated
Span Data Models - Span data models and evaluation context
Evaluation Framework API Reference - Built-in and custom evaluators
Decorators API Reference - @evaluate decorator for automatic evaluation