Evaluation Data Models ====================== .. note:: **Technical specification for HoneyHive evaluation data structures** This document defines the exact data models and formats used for evaluations in the HoneyHive SDK. Evaluations assess the quality, accuracy, and performance of LLM outputs using various metrics and criteria. Core Evaluation Model --------------------- .. py:class:: Evaluation The primary evaluation data structure. .. py:attribute:: evaluation_id :type: str Unique identifier for the evaluation. **Format**: UUID v4 string **Example**: ``"eval_01234567-89ab-cdef-0123-456789abcdef"`` **Required**: Auto-generated by SDK .. py:attribute:: target_event_id :type: str ID of the event being evaluated. **Format**: UUID v4 string **Example**: ``"evt_01234567-89ab-cdef-0123-456789abcdef"`` **Required**: Yes .. py:attribute:: evaluator_name :type: str Name of the evaluator used. **Examples**: ``"factual_accuracy"``, ``"relevance"``, ``"toxicity"``, ``"coherence"`` **Required**: Yes .. py:attribute:: evaluator_version :type: Optional[str] Version of the evaluator. **Format**: Semantic version string **Example**: ``"v1.2.0"`` **Required**: No .. py:attribute:: score :type: Union[float, int, bool, str] The evaluation score. **Types**: - ``float``: Continuous scores (0.0-1.0, 0-100, etc.) - ``int``: Discrete scores (1-5, 0-10, etc.) - ``bool``: Binary pass/fail evaluations - ``str``: Categorical scores ("good", "bad", "excellent") **Examples**: ``0.85``, ``4``, ``True``, ``"excellent"`` **Required**: No (some evaluators may only provide explanations) .. py:attribute:: explanation :type: Optional[str] Human-readable explanation of the evaluation. **Example**: ``"The response accurately answers the question with relevant supporting evidence and maintains appropriate tone."`` **Required**: No .. py:attribute:: confidence :type: Optional[float] Confidence level in the evaluation (0.0-1.0). **Range**: 0.0 (no confidence) to 1.0 (full confidence) **Example**: ``0.92`` **Required**: No .. py:attribute:: criteria :type: Optional[Dict[str, Any]] Evaluation criteria and parameters used. **Structure**: Evaluator-specific criteria **Example**: .. code-block:: json { "accuracy_threshold": 0.8, "check_citations": true, "reference_sources": ["wikipedia", "academic_papers"], "language": "en", "domain": "science" } **Required**: No .. py:attribute:: metadata :type: Optional[Dict[str, Any]] Additional evaluation metadata. **Structure**: Key-value pairs of contextual information **Example**: .. code-block:: json { "evaluator_model": "gpt-4", "evaluation_prompt_version": "v2.1", "reference_data_version": "2024-01-15", "human_annotator_id": "annotator_123", "evaluation_environment": "production" } **Required**: No .. py:attribute:: timestamp :type: datetime When the evaluation was performed. **Format**: ISO 8601 timestamp **Example**: ``"2024-01-15T10:30:45.123456Z"`` **Required**: Auto-generated by SDK .. py:attribute:: duration_ms :type: Optional[float] Time taken to perform the evaluation. **Unit**: Milliseconds **Example**: ``1250.5`` **Required**: No .. py:attribute:: cost_usd :type: Optional[float] Cost to perform the evaluation (if applicable). **Unit**: US Dollars **Example**: ``0.001`` **Required**: No .. py:attribute:: status :type: str Evaluation completion status. **Values**: - ``"completed"`` - Evaluation finished successfully - ``"failed"`` - Evaluation failed due to error - ``"skipped"`` - Evaluation was skipped - ``"pending"`` - Evaluation is in progress **Required**: Auto-determined by SDK .. py:attribute:: error :type: Optional[Dict[str, Any]] Error information if evaluation failed. **Structure**: .. code-block:: json { "type": "EvaluationError", "message": "Reference data not found", "code": "missing_reference", "context": { "evaluator": "factual_accuracy", "reference_id": "ref_123" } } **Required**: No (only for failed evaluations) Built-in Evaluator Models ------------------------- **Quality Evaluator**: .. py:class:: QualityEvaluation Evaluates overall response quality. **Inherits**: All fields from :py:class:`Evaluation` **Specific Fields**: .. py:attribute:: dimensions :type: Dict[str, float] Individual quality dimensions. **Structure**: .. code-block:: json { "relevance": 0.85, "coherence": 0.92, "clarity": 0.78, "completeness": 0.88, "accuracy": 0.95 } **Example**: .. code-block:: json { "evaluation_id": "eval_quality_001", "evaluator_name": "quality", "score": 0.876, "explanation": "High quality response with good relevance and accuracy", "dimensions": { "relevance": 0.85, "coherence": 0.92, "clarity": 0.78, "completeness": 0.88, "accuracy": 0.95 } } **Factual Accuracy Evaluator**: .. py:class:: FactualAccuracyEvaluation Evaluates factual correctness of responses. **Specific Fields**: .. py:attribute:: factual_claims :type: List[Dict[str, Any]] Individual factual claims and their verification. **Structure**: .. code-block:: json [ { "claim": "Paris is the capital of France", "verified": true, "confidence": 0.99, "sources": ["wikipedia.org/Paris"] }, { "claim": "The population is 12 million", "verified": false, "confidence": 0.85, "correct_value": "2.16 million", "sources": ["insee.fr"] } ] .. py:attribute:: citation_accuracy :type: Optional[float] Accuracy of citations and references (0.0-1.0). .. py:attribute:: hallucination_detected :type: bool Whether hallucinations were detected. **Example**: .. code-block:: json { "evaluation_id": "eval_fact_001", "evaluator_name": "factual_accuracy", "score": 0.75, "explanation": "Most facts are correct but population figure is outdated", "factual_claims": [ { "claim": "Paris is the capital of France", "verified": true, "confidence": 0.99 } ], "citation_accuracy": 0.8, "hallucination_detected": false } **Toxicity Evaluator**: .. py:class:: ToxicityEvaluation Evaluates content toxicity and safety. **Specific Fields**: .. py:attribute:: toxicity_categories :type: Dict[str, float] Scores for different toxicity categories. **Structure**: .. code-block:: json { "hate_speech": 0.02, "harassment": 0.01, "violence": 0.03, "sexual_content": 0.00, "profanity": 0.05, "identity_attack": 0.01 } .. py:attribute:: overall_toxicity :type: float Overall toxicity score (0.0-1.0). .. py:attribute:: content_warnings :type: List[str] Specific content warnings if applicable. **Example**: .. code-block:: json { "evaluation_id": "eval_toxic_001", "evaluator_name": "toxicity", "score": 0.05, "explanation": "Content is safe with minimal profanity", "toxicity_categories": { "hate_speech": 0.02, "harassment": 0.01, "violence": 0.03, "profanity": 0.05 }, "overall_toxicity": 0.05, "content_warnings": [] } **Relevance Evaluator**: .. py:class:: RelevanceEvaluation Evaluates response relevance to the query. **Specific Fields**: .. py:attribute:: query_intent_match :type: float How well the response matches the query intent (0.0-1.0). .. py:attribute:: topic_alignment :type: float Alignment with the expected topic (0.0-1.0). .. py:attribute:: information_completeness :type: float Completeness of information provided (0.0-1.0). **Example**: .. code-block:: json { "evaluation_id": "eval_rel_001", "evaluator_name": "relevance", "score": 0.88, "explanation": "Response directly addresses the question with comprehensive information", "query_intent_match": 0.92, "topic_alignment": 0.85, "information_completeness": 0.87 } **Length Evaluator**: .. py:class:: LengthEvaluation Evaluates response length appropriateness. **Specific Fields**: .. py:attribute:: character_count :type: int Number of characters in the response. .. py:attribute:: word_count :type: int Number of words in the response. .. py:attribute:: sentence_count :type: int Number of sentences in the response. .. py:attribute:: expected_length_range :type: Dict[str, int] Expected length range. **Structure**: .. code-block:: json { "min_words": 50, "max_words": 200, "optimal_words": 125 } .. py:attribute:: length_appropriateness :type: str Assessment of length appropriateness. **Values**: ``"too_short"``, ``"appropriate"``, ``"too_long"`` **Example**: .. code-block:: json { "evaluation_id": "eval_len_001", "evaluator_name": "length", "score": 0.9, "explanation": "Response length is appropriate for the question complexity", "character_count": 543, "word_count": 87, "sentence_count": 6, "expected_length_range": { "min_words": 50, "max_words": 150 }, "length_appropriateness": "appropriate" } Custom Evaluator Model ---------------------- .. py:class:: CustomEvaluation Template for custom evaluators. **Inherits**: All fields from :py:class:`Evaluation` **Custom Fields**: Custom evaluators can add domain-specific fields: .. code-block:: python # Example: Medical accuracy evaluator { "evaluation_id": "eval_med_001", "evaluator_name": "medical_accuracy", "score": 0.85, "explanation": "Medically sound advice with appropriate caveats", "medical_fields": { "diagnosis_accuracy": 0.9, "treatment_appropriateness": 0.8, "contraindication_awareness": 0.95, "disclaimer_present": True }, "risk_level": "low", "regulatory_compliance": True, "citations_medical_literature": 3 } Multi-Evaluator Results ----------------------- .. py:class:: MultiEvaluationResult Results from running multiple evaluators on the same target. .. py:attribute:: target_event_id :type: str ID of the evaluated event. .. py:attribute:: evaluations :type: List[Evaluation] List of individual evaluation results. .. py:attribute:: summary_score :type: Optional[float] Aggregated score across all evaluations. .. py:attribute:: summary_method :type: Optional[str] Method used for score aggregation. **Values**: ``"weighted_average"``, ``"simple_average"``, ``"minimum"``, ``"custom"`` .. py:attribute:: weights :type: Optional[Dict[str, float]] Weights used for aggregation. **Example**: .. code-block:: json { "factual_accuracy": 0.4, "relevance": 0.3, "quality": 0.2, "toxicity": 0.1 } **Example**: .. code-block:: json { "target_event_id": "evt_12345", "evaluations": [ { "evaluator_name": "factual_accuracy", "score": 0.92 }, { "evaluator_name": "relevance", "score": 0.88 }, { "evaluator_name": "toxicity", "score": 0.05 } ], "summary_score": 0.89, "summary_method": "weighted_average", "weights": { "factual_accuracy": 0.5, "relevance": 0.3, "toxicity": 0.2 } } Evaluation Batch Model ---------------------- .. py:class:: EvaluationBatch Represents a batch of evaluations for efficient processing. .. py:attribute:: batch_id :type: str Unique identifier for the batch. .. py:attribute:: evaluations :type: List[Evaluation] List of evaluations in the batch. .. py:attribute:: batch_metadata :type: Dict[str, Any] Batch-level metadata. **Example**: .. code-block:: json { "batch_size": 100, "created_at": "2024-01-15T10:30:45Z", "evaluator_versions": { "quality": "v1.2.0", "factual_accuracy": "v2.1.0" }, "processing_time_ms": 5432.1 } .. py:attribute:: status :type: str Batch processing status. **Values**: ``"pending"``, ``"processing"``, ``"completed"``, ``"failed"`` Evaluation Configuration ------------------------ .. py:class:: EvaluationConfig Configuration for evaluation runs. .. py:attribute:: evaluators :type: List[str] List of evaluator names to run. .. py:attribute:: evaluator_configs :type: Dict[str, Dict[str, Any]] Per-evaluator configuration. **Example**: .. code-block:: json { "factual_accuracy": { "reference_sources": ["wikipedia", "scholarly"], "confidence_threshold": 0.8 }, "toxicity": { "categories": ["hate_speech", "harassment"], "threshold": 0.1 } } .. py:attribute:: parallel_execution :type: bool Whether to run evaluators in parallel. .. py:attribute:: timeout_ms :type: Optional[int] Timeout for evaluation execution. .. py:attribute:: retry_config :type: Optional[Dict[str, Any]] Retry configuration for failed evaluations. **Example**: .. code-block:: json { "max_retries": 3, "backoff_multiplier": 2.0, "initial_delay_ms": 1000 } Evaluation Metrics ------------------ **Performance Metrics**: .. code-block:: json { "evaluation_performance": { "total_evaluations": 1000, "successful_evaluations": 985, "failed_evaluations": 15, "success_rate": 0.985, "average_duration_ms": 1250.5, "p95_duration_ms": 3200.0, "total_cost_usd": 0.15 } } **Quality Metrics**: .. code-block:: json { "evaluation_quality": { "evaluator_agreement": 0.82, "inter_evaluator_correlation": { "quality_vs_relevance": 0.75, "factual_accuracy_vs_quality": 0.68 }, "confidence_distribution": { "high_confidence": 0.65, "medium_confidence": 0.25, "low_confidence": 0.10 } } } Evaluation Serialization ------------------------ **JSON Format**: .. code-block:: python import json from datetime import datetime evaluation = { "evaluation_id": "eval_123", "target_event_id": "evt_456", "evaluator_name": "quality", "score": 0.85, "timestamp": datetime.utcnow().isoformat() + "Z", # ... other fields } json_data = json.dumps(evaluation, ensure_ascii=False, indent=2) **Pydantic Models**: .. code-block:: python from pydantic import BaseModel, Field from typing import Optional, Dict, Any, Union from datetime import datetime class EvaluationModel(BaseModel): evaluation_id: str = Field(..., description="Unique evaluation identifier") target_event_id: str = Field(..., description="ID of evaluated event") evaluator_name: str = Field(..., description="Name of evaluator") score: Optional[Union[float, int, bool, str]] = Field(None, description="Evaluation score") explanation: Optional[str] = Field(None, description="Score explanation") confidence: Optional[float] = Field(None, ge=0.0, le=1.0, description="Confidence level") timestamp: datetime = Field(default_factory=datetime.utcnow, description="Evaluation timestamp") class Config: json_encoders = { datetime: lambda v: v.isoformat() + "Z" } Common Evaluation Patterns -------------------------- **A/B Testing Evaluations**: .. code-block:: json { "evaluation_id": "eval_ab_001", "evaluator_name": "ab_test_comparison", "score": 0.65, "explanation": "Variant A preferred in 65% of comparisons", "comparison_data": { "variant_a_event_id": "evt_variant_a", "variant_b_event_id": "evt_variant_b", "preference_score": 0.65, "confidence_interval": [0.58, 0.72], "sample_size": 1000, "statistical_significance": true } } **Human vs AI Evaluations**: .. code-block:: json { "evaluation_id": "eval_human_ai_001", "evaluator_name": "human_ai_comparison", "human_evaluation": { "annotator_id": "human_123", "score": 0.8, "explanation": "Good response but could be more concise" }, "ai_evaluation": { "model": "gpt-4", "score": 0.85, "explanation": "High quality response with good structure" }, "agreement_score": 0.92, "discrepancy_analysis": "Minor difference in verbosity assessment" } **Temporal Evaluations**: .. code-block:: json { "evaluation_id": "eval_temporal_001", "evaluator_name": "temporal_quality", "current_score": 0.85, "historical_scores": [0.78, 0.82, 0.85], "trend": "improving", "change_rate": 0.02, "stability_score": 0.91 } Best Practices -------------- **Evaluation Design Guidelines**: 1. **Clear Metrics**: Define clear, measurable evaluation criteria 2. **Consistent Scoring**: Use consistent scoring scales across evaluators 3. **Rich Context**: Provide explanations and confidence scores 4. **Error Handling**: Gracefully handle evaluation failures 5. **Performance**: Optimize for evaluation speed and cost 6. **Validation**: Validate evaluator performance with ground truth data **Data Quality**: 1. **Score Normalization**: Ensure scores are on consistent scales 2. **Missing Data**: Handle cases where evaluations cannot be performed 3. **Confidence Reporting**: Always report confidence when available 4. **Metadata Capture**: Include relevant evaluation context **Performance Optimization**: 1. **Batch Processing**: Use batched evaluations for efficiency 2. **Caching**: Cache evaluation results for repeated queries 3. **Parallel Execution**: Run independent evaluators in parallel 4. **Timeout Handling**: Set appropriate timeouts for evaluators See Also -------- - :doc:`events` - Event data models being evaluated - :doc:`spans` - Span data models and evaluation context - :doc:`../evaluation/evaluators` - Built-in and custom evaluators - :doc:`../api/decorators` - @evaluate decorator for automatic evaluation