Evaluation Data Models

Note

Technical specification for HoneyHive evaluation data structures

This document defines the exact data models and formats used for evaluations in the HoneyHive SDK.

Evaluations assess the quality, accuracy, and performance of LLM outputs using various metrics and criteria.

Core Evaluation Model

class Evaluation

The primary evaluation data structure.

evaluation_id: str

Unique identifier for the evaluation.

Format: UUID v4 string Example: "eval_01234567-89ab-cdef-0123-456789abcdef" Required: Auto-generated by SDK

target_event_id: str

ID of the event being evaluated.

Format: UUID v4 string Example: "evt_01234567-89ab-cdef-0123-456789abcdef" Required: Yes

evaluator_name: str

Name of the evaluator used.

Examples: "factual_accuracy", "relevance", "toxicity", "coherence" Required: Yes

evaluator_version: str | None

Version of the evaluator.

Format: Semantic version string Example: "v1.2.0" Required: No

score: float | int | bool | str

The evaluation score.

Types: - float: Continuous scores (0.0-1.0, 0-100, etc.) - int: Discrete scores (1-5, 0-10, etc.) - bool: Binary pass/fail evaluations - str: Categorical scores (“good”, “bad”, “excellent”)

Examples: 0.85, 4, True, "excellent" Required: No (some evaluators may only provide explanations)

explanation: str | None

Human-readable explanation of the evaluation.

Example: "The response accurately answers the question with relevant supporting evidence and maintains appropriate tone." Required: No

confidence: float | None

Confidence level in the evaluation (0.0-1.0).

Range: 0.0 (no confidence) to 1.0 (full confidence) Example: 0.92 Required: No

criteria: Dict[str, Any] | None

Evaluation criteria and parameters used.

Structure: Evaluator-specific criteria Example:

{
  "accuracy_threshold": 0.8,
  "check_citations": true,
  "reference_sources": ["wikipedia", "academic_papers"],
  "language": "en",
  "domain": "science"
}

Required: No

metadata: Dict[str, Any] | None

Additional evaluation metadata.

Structure: Key-value pairs of contextual information Example:

{
  "evaluator_model": "gpt-4",
  "evaluation_prompt_version": "v2.1",
  "reference_data_version": "2024-01-15",
  "human_annotator_id": "annotator_123",
  "evaluation_environment": "production"
}

Required: No

timestamp: datetime

When the evaluation was performed.

Format: ISO 8601 timestamp Example: "2024-01-15T10:30:45.123456Z" Required: Auto-generated by SDK

duration_ms: float | None

Time taken to perform the evaluation.

Unit: Milliseconds Example: 1250.5 Required: No

cost_usd: float | None

Cost to perform the evaluation (if applicable).

Unit: US Dollars Example: 0.001 Required: No

status: str

Evaluation completion status.

Values: - "completed" - Evaluation finished successfully - "failed" - Evaluation failed due to error - "skipped" - Evaluation was skipped - "pending" - Evaluation is in progress

Required: Auto-determined by SDK

error: Dict[str, Any] | None

Error information if evaluation failed.

Structure:

{
  "type": "EvaluationError",
  "message": "Reference data not found",
  "code": "missing_reference",
  "context": {
    "evaluator": "factual_accuracy",
    "reference_id": "ref_123"
  }
}

Required: No (only for failed evaluations)

Built-in Evaluator Models

Quality Evaluator:

class QualityEvaluation

Evaluates overall response quality.

Inherits: All fields from Evaluation

Specific Fields:

dimensions: Dict[str, float]

Individual quality dimensions.

Structure:

{
  "relevance": 0.85,
  "coherence": 0.92,
  "clarity": 0.78,
  "completeness": 0.88,
  "accuracy": 0.95
}

Example:

{
  "evaluation_id": "eval_quality_001",
  "evaluator_name": "quality",
  "score": 0.876,
  "explanation": "High quality response with good relevance and accuracy",
  "dimensions": {
    "relevance": 0.85,
    "coherence": 0.92,
    "clarity": 0.78,
    "completeness": 0.88,
    "accuracy": 0.95
  }
}

Factual Accuracy Evaluator:

class FactualAccuracyEvaluation

Evaluates factual correctness of responses.

Specific Fields:

factual_claims: List[Dict[str, Any]]

Individual factual claims and their verification.

Structure:

[
  {
    "claim": "Paris is the capital of France",
    "verified": true,
    "confidence": 0.99,
    "sources": ["wikipedia.org/Paris"]
  },
  {
    "claim": "The population is 12 million",
    "verified": false,
    "confidence": 0.85,
    "correct_value": "2.16 million",
    "sources": ["insee.fr"]
  }
]

citation_accuracy: float | None: Accuracy of citations and references (0.0-1.0).

hallucination_detected: bool: Whether hallucinations were detected.

Example:

{
  "evaluation_id": "eval_fact_001",
  "evaluator_name": "factual_accuracy",
  "score": 0.75,
  "explanation": "Most facts are correct but population figure is outdated",
  "factual_claims": [
    {
      "claim": "Paris is the capital of France",
      "verified": true,
      "confidence": 0.99
    }
  ],
  "citation_accuracy": 0.8,
  "hallucination_detected": false
}

Toxicity Evaluator:

class ToxicityEvaluation

Evaluates content toxicity and safety.

Specific Fields:

toxicity_categories: Dict[str, float]

Scores for different toxicity categories.

Structure:

{
  "hate_speech": 0.02,
  "harassment": 0.01,
  "violence": 0.03,
  "sexual_content": 0.00,
  "profanity": 0.05,
  "identity_attack": 0.01
}

overall_toxicity: float: Overall toxicity score (0.0-1.0).

content_warnings: List[str]: Specific content warnings if applicable.

Example:

{
  "evaluation_id": "eval_toxic_001",
  "evaluator_name": "toxicity",
  "score": 0.05,
  "explanation": "Content is safe with minimal profanity",
  "toxicity_categories": {
    "hate_speech": 0.02,
    "harassment": 0.01,
    "violence": 0.03,
    "profanity": 0.05
  },
  "overall_toxicity": 0.05,
  "content_warnings": []
}

Relevance Evaluator:

class RelevanceEvaluation

Evaluates response relevance to the query.

Specific Fields:

query_intent_match: float: How well the response matches the query intent (0.0-1.0).

topic_alignment: float: Alignment with the expected topic (0.0-1.0).

information_completeness: float: Completeness of information provided (0.0-1.0).

Example:

{
  "evaluation_id": "eval_rel_001",
  "evaluator_name": "relevance",
  "score": 0.88,
  "explanation": "Response directly addresses the question with comprehensive information",
  "query_intent_match": 0.92,
  "topic_alignment": 0.85,
  "information_completeness": 0.87
}

Length Evaluator:

class LengthEvaluation

Evaluates response length appropriateness.

Specific Fields:

character_count: int: Number of characters in the response.

word_count: int: Number of words in the response.

sentence_count: int: Number of sentences in the response.

expected_length_range: Dict[str, int]

Expected length range.

Structure:

{
  "min_words": 50,
  "max_words": 200,
  "optimal_words": 125
}

length_appropriateness: str

Assessment of length appropriateness.

Values: "too_short", "appropriate", "too_long"

Example:

{
  "evaluation_id": "eval_len_001",
  "evaluator_name": "length",
  "score": 0.9,
  "explanation": "Response length is appropriate for the question complexity",
  "character_count": 543,
  "word_count": 87,
  "sentence_count": 6,
  "expected_length_range": {
    "min_words": 50,
    "max_words": 150
  },
  "length_appropriateness": "appropriate"
}

Custom Evaluator Model

class CustomEvaluation

Template for custom evaluators.

Inherits: All fields from Evaluation

Custom Fields:

Custom evaluators can add domain-specific fields:

# Example: Medical accuracy evaluator
{
  "evaluation_id": "eval_med_001",
  "evaluator_name": "medical_accuracy",
  "score": 0.85,
  "explanation": "Medically sound advice with appropriate caveats",
  "medical_fields": {
    "diagnosis_accuracy": 0.9,
    "treatment_appropriateness": 0.8,
    "contraindication_awareness": 0.95,
    "disclaimer_present": True
  },
  "risk_level": "low",
  "regulatory_compliance": True,
  "citations_medical_literature": 3
}

Multi-Evaluator Results

class MultiEvaluationResult

Results from running multiple evaluators on the same target.

target_event_id: str: ID of the evaluated event.

evaluations: List[Evaluation]: List of individual evaluation results.

summary_score: float | None: Aggregated score across all evaluations.

summary_method: str | None

Method used for score aggregation.

Values: "weighted_average", "simple_average", "minimum", "custom"

weights: Dict[str, float] | None

Weights used for aggregation.

Example:

{
  "factual_accuracy": 0.4,
  "relevance": 0.3,
  "quality": 0.2,
  "toxicity": 0.1
}

Example:

{
  "target_event_id": "evt_12345",
  "evaluations": [
    {
      "evaluator_name": "factual_accuracy",
      "score": 0.92
    },
    {
      "evaluator_name": "relevance",
      "score": 0.88
    },
    {
      "evaluator_name": "toxicity",
      "score": 0.05
    }
  ],
  "summary_score": 0.89,
  "summary_method": "weighted_average",
  "weights": {
    "factual_accuracy": 0.5,
    "relevance": 0.3,
    "toxicity": 0.2
  }
}

Evaluation Batch Model

class EvaluationBatch

Represents a batch of evaluations for efficient processing.

batch_id: str: Unique identifier for the batch.

evaluations: List[Evaluation]: List of evaluations in the batch.

batch_metadata: Dict[str, Any]

Batch-level metadata.

Example:

{
  "batch_size": 100,
  "created_at": "2024-01-15T10:30:45Z",
  "evaluator_versions": {
    "quality": "v1.2.0",
    "factual_accuracy": "v2.1.0"
  },
  "processing_time_ms": 5432.1
}

status: str

Batch processing status.

Values: "pending", "processing", "completed", "failed"

Evaluation Configuration

class EvaluationConfig

Configuration for evaluation runs.

evaluators: List[str]: List of evaluator names to run.

evaluator_configs: Dict[str, Dict[str, Any]]

Per-evaluator configuration.

Example:

{
  "factual_accuracy": {
    "reference_sources": ["wikipedia", "scholarly"],
    "confidence_threshold": 0.8
  },
  "toxicity": {
    "categories": ["hate_speech", "harassment"],
    "threshold": 0.1
  }
}

parallel_execution: bool: Whether to run evaluators in parallel.

timeout_ms: int | None: Timeout for evaluation execution.

retry_config: Dict[str, Any] | None

Retry configuration for failed evaluations.

Example:

{
  "max_retries": 3,
  "backoff_multiplier": 2.0,
  "initial_delay_ms": 1000
}

Evaluation Metrics

Performance Metrics:

{
  "evaluation_performance": {
    "total_evaluations": 1000,
    "successful_evaluations": 985,
    "failed_evaluations": 15,
    "success_rate": 0.985,
    "average_duration_ms": 1250.5,
    "p95_duration_ms": 3200.0,
    "total_cost_usd": 0.15
  }
}

Quality Metrics:

{
  "evaluation_quality": {
    "evaluator_agreement": 0.82,
    "inter_evaluator_correlation": {
      "quality_vs_relevance": 0.75,
      "factual_accuracy_vs_quality": 0.68
    },
    "confidence_distribution": {
      "high_confidence": 0.65,
      "medium_confidence": 0.25,
      "low_confidence": 0.10
    }
  }
}

Evaluation Serialization

JSON Format:

import json
from datetime import datetime

evaluation = {
    "evaluation_id": "eval_123",
    "target_event_id": "evt_456",
    "evaluator_name": "quality",
    "score": 0.85,
    "timestamp": datetime.utcnow().isoformat() + "Z",
    # ... other fields
}

json_data = json.dumps(evaluation, ensure_ascii=False, indent=2)

Pydantic Models:

from pydantic import BaseModel, Field
from typing import Optional, Dict, Any, Union
from datetime import datetime

class EvaluationModel(BaseModel):
    evaluation_id: str = Field(..., description="Unique evaluation identifier")
    target_event_id: str = Field(..., description="ID of evaluated event")
    evaluator_name: str = Field(..., description="Name of evaluator")
    score: Optional[Union[float, int, bool, str]] = Field(None, description="Evaluation score")
    explanation: Optional[str] = Field(None, description="Score explanation")
    confidence: Optional[float] = Field(None, ge=0.0, le=1.0, description="Confidence level")
    timestamp: datetime = Field(default_factory=datetime.utcnow, description="Evaluation timestamp")

    class Config:
        json_encoders = {
            datetime: lambda v: v.isoformat() + "Z"
        }

Common Evaluation Patterns

A/B Testing Evaluations:

{
  "evaluation_id": "eval_ab_001",
  "evaluator_name": "ab_test_comparison",
  "score": 0.65,
  "explanation": "Variant A preferred in 65% of comparisons",
  "comparison_data": {
    "variant_a_event_id": "evt_variant_a",
    "variant_b_event_id": "evt_variant_b",
    "preference_score": 0.65,
    "confidence_interval": [0.58, 0.72],
    "sample_size": 1000,
    "statistical_significance": true
  }
}

Human vs AI Evaluations:

{
  "evaluation_id": "eval_human_ai_001",
  "evaluator_name": "human_ai_comparison",
  "human_evaluation": {
    "annotator_id": "human_123",
    "score": 0.8,
    "explanation": "Good response but could be more concise"
  },
  "ai_evaluation": {
    "model": "gpt-4",
    "score": 0.85,
    "explanation": "High quality response with good structure"
  },
  "agreement_score": 0.92,
  "discrepancy_analysis": "Minor difference in verbosity assessment"
}

Temporal Evaluations:

{
  "evaluation_id": "eval_temporal_001",
  "evaluator_name": "temporal_quality",
  "current_score": 0.85,
  "historical_scores": [0.78, 0.82, 0.85],
  "trend": "improving",
  "change_rate": 0.02,
  "stability_score": 0.91
}

Best Practices

Evaluation Design Guidelines:

Clear Metrics: Define clear, measurable evaluation criteria
Consistent Scoring: Use consistent scoring scales across evaluators
Rich Context: Provide explanations and confidence scores
Error Handling: Gracefully handle evaluation failures
Performance: Optimize for evaluation speed and cost
Validation: Validate evaluator performance with ground truth data

Data Quality:

Score Normalization: Ensure scores are on consistent scales
Missing Data: Handle cases where evaluations cannot be performed
Confidence Reporting: Always report confidence when available
Metadata Capture: Include relevant evaluation context

Performance Optimization:

Batch Processing: Use batched evaluations for efficiency
Caching: Cache evaluation results for repeated queries
Parallel Execution: Run independent evaluators in parallel
Timeout Handling: Set appropriate timeouts for evaluators

Evaluation Data Models

Core Evaluation Model

Built-in Evaluator Models

Custom Evaluator Model

Multi-Evaluator Results

Evaluation Batch Model

Evaluation Configuration

Evaluation Metrics

Evaluation Serialization

Common Evaluation Patterns

Best Practices

See Also