Evaluation Data Models

Note

Technical specification for HoneyHive evaluation data structures

This document defines the exact data models and formats used for evaluations in the HoneyHive SDK.

Evaluations assess the quality, accuracy, and performance of LLM outputs using various metrics and criteria.

Core Evaluation Model

class Evaluation

The primary evaluation data structure.

evaluation_id: str

Unique identifier for the evaluation.

Format: UUID v4 string Example: "eval_01234567-89ab-cdef-0123-456789abcdef" Required: Auto-generated by SDK

target_event_id: str

ID of the event being evaluated.

Format: UUID v4 string Example: "evt_01234567-89ab-cdef-0123-456789abcdef" Required: Yes

evaluator_name: str

Name of the evaluator used.

Examples: "factual_accuracy", "relevance", "toxicity", "coherence" Required: Yes

evaluator_version: str | None

Version of the evaluator.

Format: Semantic version string Example: "v1.2.0" Required: No

score: float | int | bool | str

The evaluation score.

Types: - float: Continuous scores (0.0-1.0, 0-100, etc.) - int: Discrete scores (1-5, 0-10, etc.) - bool: Binary pass/fail evaluations - str: Categorical scores (“good”, “bad”, “excellent”)

Examples: 0.85, 4, True, "excellent" Required: No (some evaluators may only provide explanations)

explanation: str | None

Human-readable explanation of the evaluation.

Example: "The response accurately answers the question with relevant supporting evidence and maintains appropriate tone." Required: No

confidence: float | None

Confidence level in the evaluation (0.0-1.0).

Range: 0.0 (no confidence) to 1.0 (full confidence) Example: 0.92 Required: No

criteria: Dict[str, Any] | None

Evaluation criteria and parameters used.

Structure: Evaluator-specific criteria Example:

{
  "accuracy_threshold": 0.8,
  "check_citations": true,
  "reference_sources": ["wikipedia", "academic_papers"],
  "language": "en",
  "domain": "science"
}

Required: No

metadata: Dict[str, Any] | None

Additional evaluation metadata.

Structure: Key-value pairs of contextual information Example:

{
  "evaluator_model": "gpt-4",
  "evaluation_prompt_version": "v2.1",
  "reference_data_version": "2024-01-15",
  "human_annotator_id": "annotator_123",
  "evaluation_environment": "production"
}

Required: No

timestamp: datetime

When the evaluation was performed.

Format: ISO 8601 timestamp Example: "2024-01-15T10:30:45.123456Z" Required: Auto-generated by SDK

duration_ms: float | None

Time taken to perform the evaluation.

Unit: Milliseconds Example: 1250.5 Required: No

cost_usd: float | None

Cost to perform the evaluation (if applicable).

Unit: US Dollars Example: 0.001 Required: No

status: str

Evaluation completion status.

Values: - "completed" - Evaluation finished successfully - "failed" - Evaluation failed due to error - "skipped" - Evaluation was skipped - "pending" - Evaluation is in progress

Required: Auto-determined by SDK

error: Dict[str, Any] | None

Error information if evaluation failed.

Structure:

{
  "type": "EvaluationError",
  "message": "Reference data not found",
  "code": "missing_reference",
  "context": {
    "evaluator": "factual_accuracy",
    "reference_id": "ref_123"
  }
}

Required: No (only for failed evaluations)

Built-in Evaluator Models

Quality Evaluator:

class QualityEvaluation

Evaluates overall response quality.

Inherits: All fields from Evaluation

Specific Fields:

dimensions: Dict[str, float]

Individual quality dimensions.

Structure:

{
  "relevance": 0.85,
  "coherence": 0.92,
  "clarity": 0.78,
  "completeness": 0.88,
  "accuracy": 0.95
}

Example:

{
  "evaluation_id": "eval_quality_001",
  "evaluator_name": "quality",
  "score": 0.876,
  "explanation": "High quality response with good relevance and accuracy",
  "dimensions": {
    "relevance": 0.85,
    "coherence": 0.92,
    "clarity": 0.78,
    "completeness": 0.88,
    "accuracy": 0.95
  }
}

Factual Accuracy Evaluator:

class FactualAccuracyEvaluation

Evaluates factual correctness of responses.

Specific Fields:

factual_claims: List[Dict[str, Any]]

Individual factual claims and their verification.

Structure:

[
  {
    "claim": "Paris is the capital of France",
    "verified": true,
    "confidence": 0.99,
    "sources": ["wikipedia.org/Paris"]
  },
  {
    "claim": "The population is 12 million",
    "verified": false,
    "confidence": 0.85,
    "correct_value": "2.16 million",
    "sources": ["insee.fr"]
  }
]
citation_accuracy: float | None

Accuracy of citations and references (0.0-1.0).

hallucination_detected: bool

Whether hallucinations were detected.

Example:

{
  "evaluation_id": "eval_fact_001",
  "evaluator_name": "factual_accuracy",
  "score": 0.75,
  "explanation": "Most facts are correct but population figure is outdated",
  "factual_claims": [
    {
      "claim": "Paris is the capital of France",
      "verified": true,
      "confidence": 0.99
    }
  ],
  "citation_accuracy": 0.8,
  "hallucination_detected": false
}

Toxicity Evaluator:

class ToxicityEvaluation

Evaluates content toxicity and safety.

Specific Fields:

toxicity_categories: Dict[str, float]

Scores for different toxicity categories.

Structure:

{
  "hate_speech": 0.02,
  "harassment": 0.01,
  "violence": 0.03,
  "sexual_content": 0.00,
  "profanity": 0.05,
  "identity_attack": 0.01
}
overall_toxicity: float

Overall toxicity score (0.0-1.0).

content_warnings: List[str]

Specific content warnings if applicable.

Example:

{
  "evaluation_id": "eval_toxic_001",
  "evaluator_name": "toxicity",
  "score": 0.05,
  "explanation": "Content is safe with minimal profanity",
  "toxicity_categories": {
    "hate_speech": 0.02,
    "harassment": 0.01,
    "violence": 0.03,
    "profanity": 0.05
  },
  "overall_toxicity": 0.05,
  "content_warnings": []
}

Relevance Evaluator:

class RelevanceEvaluation

Evaluates response relevance to the query.

Specific Fields:

query_intent_match: float

How well the response matches the query intent (0.0-1.0).

topic_alignment: float

Alignment with the expected topic (0.0-1.0).

information_completeness: float

Completeness of information provided (0.0-1.0).

Example:

{
  "evaluation_id": "eval_rel_001",
  "evaluator_name": "relevance",
  "score": 0.88,
  "explanation": "Response directly addresses the question with comprehensive information",
  "query_intent_match": 0.92,
  "topic_alignment": 0.85,
  "information_completeness": 0.87
}

Length Evaluator:

class LengthEvaluation

Evaluates response length appropriateness.

Specific Fields:

character_count: int

Number of characters in the response.

word_count: int

Number of words in the response.

sentence_count: int

Number of sentences in the response.

expected_length_range: Dict[str, int]

Expected length range.

Structure:

{
  "min_words": 50,
  "max_words": 200,
  "optimal_words": 125
}
length_appropriateness: str

Assessment of length appropriateness.

Values: "too_short", "appropriate", "too_long"

Example:

{
  "evaluation_id": "eval_len_001",
  "evaluator_name": "length",
  "score": 0.9,
  "explanation": "Response length is appropriate for the question complexity",
  "character_count": 543,
  "word_count": 87,
  "sentence_count": 6,
  "expected_length_range": {
    "min_words": 50,
    "max_words": 150
  },
  "length_appropriateness": "appropriate"
}

Custom Evaluator Model

class CustomEvaluation

Template for custom evaluators.

Inherits: All fields from Evaluation

Custom Fields:

Custom evaluators can add domain-specific fields:

# Example: Medical accuracy evaluator
{
  "evaluation_id": "eval_med_001",
  "evaluator_name": "medical_accuracy",
  "score": 0.85,
  "explanation": "Medically sound advice with appropriate caveats",
  "medical_fields": {
    "diagnosis_accuracy": 0.9,
    "treatment_appropriateness": 0.8,
    "contraindication_awareness": 0.95,
    "disclaimer_present": True
  },
  "risk_level": "low",
  "regulatory_compliance": True,
  "citations_medical_literature": 3
}

Multi-Evaluator Results

class MultiEvaluationResult

Results from running multiple evaluators on the same target.

target_event_id: str

ID of the evaluated event.

evaluations: List[Evaluation]

List of individual evaluation results.

summary_score: float | None

Aggregated score across all evaluations.

summary_method: str | None

Method used for score aggregation.

Values: "weighted_average", "simple_average", "minimum", "custom"

weights: Dict[str, float] | None

Weights used for aggregation.

Example:

{
  "factual_accuracy": 0.4,
  "relevance": 0.3,
  "quality": 0.2,
  "toxicity": 0.1
}

Example:

{
  "target_event_id": "evt_12345",
  "evaluations": [
    {
      "evaluator_name": "factual_accuracy",
      "score": 0.92
    },
    {
      "evaluator_name": "relevance",
      "score": 0.88
    },
    {
      "evaluator_name": "toxicity",
      "score": 0.05
    }
  ],
  "summary_score": 0.89,
  "summary_method": "weighted_average",
  "weights": {
    "factual_accuracy": 0.5,
    "relevance": 0.3,
    "toxicity": 0.2
  }
}

Evaluation Batch Model

class EvaluationBatch

Represents a batch of evaluations for efficient processing.

batch_id: str

Unique identifier for the batch.

evaluations: List[Evaluation]

List of evaluations in the batch.

batch_metadata: Dict[str, Any]

Batch-level metadata.

Example:

{
  "batch_size": 100,
  "created_at": "2024-01-15T10:30:45Z",
  "evaluator_versions": {
    "quality": "v1.2.0",
    "factual_accuracy": "v2.1.0"
  },
  "processing_time_ms": 5432.1
}
status: str

Batch processing status.

Values: "pending", "processing", "completed", "failed"

Evaluation Configuration

class EvaluationConfig

Configuration for evaluation runs.

evaluators: List[str]

List of evaluator names to run.

evaluator_configs: Dict[str, Dict[str, Any]]

Per-evaluator configuration.

Example:

{
  "factual_accuracy": {
    "reference_sources": ["wikipedia", "scholarly"],
    "confidence_threshold": 0.8
  },
  "toxicity": {
    "categories": ["hate_speech", "harassment"],
    "threshold": 0.1
  }
}
parallel_execution: bool

Whether to run evaluators in parallel.

timeout_ms: int | None

Timeout for evaluation execution.

retry_config: Dict[str, Any] | None

Retry configuration for failed evaluations.

Example:

{
  "max_retries": 3,
  "backoff_multiplier": 2.0,
  "initial_delay_ms": 1000
}

Evaluation Metrics

Performance Metrics:

{
  "evaluation_performance": {
    "total_evaluations": 1000,
    "successful_evaluations": 985,
    "failed_evaluations": 15,
    "success_rate": 0.985,
    "average_duration_ms": 1250.5,
    "p95_duration_ms": 3200.0,
    "total_cost_usd": 0.15
  }
}

Quality Metrics:

{
  "evaluation_quality": {
    "evaluator_agreement": 0.82,
    "inter_evaluator_correlation": {
      "quality_vs_relevance": 0.75,
      "factual_accuracy_vs_quality": 0.68
    },
    "confidence_distribution": {
      "high_confidence": 0.65,
      "medium_confidence": 0.25,
      "low_confidence": 0.10
    }
  }
}

Evaluation Serialization

JSON Format:

import json
from datetime import datetime

evaluation = {
    "evaluation_id": "eval_123",
    "target_event_id": "evt_456",
    "evaluator_name": "quality",
    "score": 0.85,
    "timestamp": datetime.utcnow().isoformat() + "Z",
    # ... other fields
}

json_data = json.dumps(evaluation, ensure_ascii=False, indent=2)

Pydantic Models:

from pydantic import BaseModel, Field
from typing import Optional, Dict, Any, Union
from datetime import datetime

class EvaluationModel(BaseModel):
    evaluation_id: str = Field(..., description="Unique evaluation identifier")
    target_event_id: str = Field(..., description="ID of evaluated event")
    evaluator_name: str = Field(..., description="Name of evaluator")
    score: Optional[Union[float, int, bool, str]] = Field(None, description="Evaluation score")
    explanation: Optional[str] = Field(None, description="Score explanation")
    confidence: Optional[float] = Field(None, ge=0.0, le=1.0, description="Confidence level")
    timestamp: datetime = Field(default_factory=datetime.utcnow, description="Evaluation timestamp")

    class Config:
        json_encoders = {
            datetime: lambda v: v.isoformat() + "Z"
        }

Common Evaluation Patterns

A/B Testing Evaluations:

{
  "evaluation_id": "eval_ab_001",
  "evaluator_name": "ab_test_comparison",
  "score": 0.65,
  "explanation": "Variant A preferred in 65% of comparisons",
  "comparison_data": {
    "variant_a_event_id": "evt_variant_a",
    "variant_b_event_id": "evt_variant_b",
    "preference_score": 0.65,
    "confidence_interval": [0.58, 0.72],
    "sample_size": 1000,
    "statistical_significance": true
  }
}

Human vs AI Evaluations:

{
  "evaluation_id": "eval_human_ai_001",
  "evaluator_name": "human_ai_comparison",
  "human_evaluation": {
    "annotator_id": "human_123",
    "score": 0.8,
    "explanation": "Good response but could be more concise"
  },
  "ai_evaluation": {
    "model": "gpt-4",
    "score": 0.85,
    "explanation": "High quality response with good structure"
  },
  "agreement_score": 0.92,
  "discrepancy_analysis": "Minor difference in verbosity assessment"
}

Temporal Evaluations:

{
  "evaluation_id": "eval_temporal_001",
  "evaluator_name": "temporal_quality",
  "current_score": 0.85,
  "historical_scores": [0.78, 0.82, 0.85],
  "trend": "improving",
  "change_rate": 0.02,
  "stability_score": 0.91
}

Best Practices

Evaluation Design Guidelines:

  1. Clear Metrics: Define clear, measurable evaluation criteria

  2. Consistent Scoring: Use consistent scoring scales across evaluators

  3. Rich Context: Provide explanations and confidence scores

  4. Error Handling: Gracefully handle evaluation failures

  5. Performance: Optimize for evaluation speed and cost

  6. Validation: Validate evaluator performance with ground truth data

Data Quality:

  1. Score Normalization: Ensure scores are on consistent scales

  2. Missing Data: Handle cases where evaluations cannot be performed

  3. Confidence Reporting: Always report confidence when available

  4. Metadata Capture: Include relevant evaluation context

Performance Optimization:

  1. Batch Processing: Use batched evaluations for efficiency

  2. Caching: Cache evaluation results for repeated queries

  3. Parallel Execution: Run independent evaluators in parallel

  4. Timeout Handling: Set appropriate timeouts for evaluators

See Also