Evaluators

Decorator-based system for defining custom quality checks and evaluators.

Overview

Evaluators assess the quality of LLM outputs. HoneyHive uses a modern decorator-based approach instead of class inheritance, making evaluators simpler and more flexible.

Key Features:

  • Simple function-based definitions

  • Support for both sync and async evaluators

  • Flexible return formats (dict, float, bool)

  • Automatic metric aggregation

  • Per-evaluator configuration

@evaluator Decorator

@evaluator(func=None, *, name=None, settings=None)

Decorator to mark a function as a synchronous evaluator.

Parameters:
  • func (Optional[Callable]) – Function to decorate (when used as @evaluator without parentheses).

  • name (Optional[str]) – Optional custom name for the evaluator. Defaults to function name.

  • settings (Optional[EvaluatorSettings]) – Optional evaluator configuration.

Returns:

Decorated evaluator function.

Return type:

Callable

Signature Requirements:

Your evaluator function must accept these parameters:

def my_evaluator(outputs, inputs, ground_truth):

    Args:
        outputs: Dict returned by your function
        inputs: Dict from datapoint["inputs"]
        ground_truth: Dict from datapoint["ground_truth"] (optional)

    Returns:
        Dict with "score" and optional metrics,
        or float (interpreted as score),
        or bool (1.0 if True, 0.0 if False)

    return {"score": 0.9, "passed": True}

Basic Usage (No Arguments)

from honeyhive.experiments import evaluator

@evaluator
def accuracy_check(outputs, inputs, ground_truth):
    """Check if output matches expected result."""
    return {
        "score": 1.0 if outputs == ground_truth else 0.0,
        "passed": outputs == ground_truth
    }

With Custom Name

@evaluator(name="custom_accuracy_v2")
def accuracy_check(outputs, inputs, ground_truth):
    return {"score": calculate_score(outputs, ground_truth)}

With Settings

from honeyhive.experiments import evaluator, EvaluatorSettings

@evaluator(settings=EvaluatorSettings(
    threshold=0.8,
    weight=2.0,
    enabled=True
))
def weighted_accuracy(outputs, inputs, ground_truth):
    score = calculate_accuracy(outputs, ground_truth)
    return {"score": score}

Return Formats

Evaluators can return various formats:

# Dict with score and metadata (RECOMMENDED)
@evaluator
def detailed_evaluator(outputs, inputs, ground_truth):
    return {
        "score": 0.85,
        "passed": True,
        "confidence": 0.95,
        "reason": "Output matches expected pattern"
    }

# Simple float score
@evaluator
def simple_score(outputs, inputs, ground_truth):
    return 0.85  # Interpreted as {"score": 0.85}

# Boolean (1.0 if True, 0.0 if False)
@evaluator
def pass_fail(outputs, inputs, ground_truth):
    return outputs["answer"] == ground_truth["answer"]

@aevaluator Decorator

@aevaluator(func=None, *, name=None, settings=None)

Decorator to mark a function as an asynchronous evaluator.

Same parameters and behavior as @evaluator, but for async functions.

Parameters:
  • func (Optional[Callable]) – Async function to decorate.

  • name (Optional[str]) – Optional custom name for the evaluator.

  • settings (Optional[EvaluatorSettings]) – Optional evaluator configuration.

Returns:

Decorated async evaluator function.

Return type:

Callable

Basic Async Evaluator

from honeyhive.experiments import aevaluator
import httpx

@aevaluator
async def external_api_check(outputs, inputs, ground_truth):
    """Call external API to validate output."""
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.example.com/validate",
            json={"output": outputs, "expected": ground_truth}
        )
        data = response.json()

        return {
            "score": data["score"],
            "api_confidence": data["confidence"]
        }

Multiple Async Operations

@aevaluator
async def multi_source_validation(outputs, inputs, ground_truth):
    """Validate against multiple external sources."""
    async with httpx.AsyncClient() as client:
        # Run validations concurrently
        results = await asyncio.gather(
            client.post("https://api1.com/check", json=outputs),
            client.post("https://api2.com/check", json=outputs),
            client.post("https://api3.com/check", json=outputs),
        )

        scores = [r.json()["score"] for r in results]
        avg_score = sum(scores) / len(scores)

        return {
            "score": avg_score,
            "individual_scores": scores,
            "sources_checked": len(scores)
        }

Mixing Sync and Async

You can use both sync and async evaluators together:

@evaluator
def fast_local_check(outputs, inputs, ground_truth):
    """Quick local validation."""
    return {"score": local_validation(outputs)}

@aevaluator
async def slow_api_check(outputs, inputs, ground_truth):
    """Slower external validation."""
    result = await external_api.validate(outputs)
    return {"score": result.score}

# Use both in evaluate()
result = evaluate(
    function=my_function,
    dataset=test_data,
    evaluators=[fast_local_check, slow_api_check],  # Mixed!
    api_key="key",
    project="project"
)

EvaluatorSettings

class EvaluatorSettings

Configuration for individual evaluators.

Parameters:
  • threshold (Optional[float]) – Minimum score to consider as “passed”.

  • weight (Optional[float]) – Relative weight for aggregation (default: 1.0).

  • enabled (bool) – Whether this evaluator is active (default: True).

  • timeout (Optional[float]) – Maximum execution time in seconds.

  • retry_count (int) – Number of retries on failure.

Usage Example

from honeyhive.experiments import evaluator, EvaluatorSettings

@evaluator(settings=EvaluatorSettings(
    threshold=0.7,      # Pass if score >= 0.7
    weight=2.0,         # Double weight in aggregation
    enabled=True,       # Can disable without removing
    timeout=5.0,        # 5 second timeout
    retry_count=3       # Retry up to 3 times
))
def critical_evaluator(outputs, inputs, ground_truth):
    return {"score": validate(outputs)}

EvalResult

class EvalResult

Result object returned by evaluators (internal representation).

Parameters:
  • score (float) – Numerical score (typically 0.0 to 1.0).

  • passed (bool) – Whether evaluation passed threshold.

  • metrics (Dict[str, Any]) – Additional metrics and metadata.

Note

This class is used internally. Your evaluator functions return dicts, which are automatically converted to EvalResult objects.

Aggregation Functions

When multiple evaluators run on the same datapoint, their scores are aggregated.

mean(scores)

Calculate arithmetic mean of scores.

>>> mean([0.8, 0.9, 0.7])
0.8
median(scores)

Calculate median score.

>>> median([0.8, 0.9, 0.7, 0.6, 1.0])
0.8
mode(scores)

Calculate mode (most common score).

>>> mode([0.8, 0.8, 0.9, 0.7, 0.8])
0.8

Evaluator Patterns

1. Exact Match

@evaluator
def exact_match(outputs, inputs, ground_truth):
    """Check for exact string match."""
    return {
        "score": 1.0 if outputs["answer"] == ground_truth["answer"] else 0.0,
        "matched": outputs["answer"] == ground_truth["answer"]
    }

2. Semantic Similarity

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

@evaluator
def semantic_similarity(outputs, inputs, ground_truth):
    """Calculate semantic similarity between output and expected."""
    output_embedding = model.encode([outputs["answer"]])
    expected_embedding = model.encode([ground_truth["answer"]])

    similarity = cosine_similarity(output_embedding, expected_embedding)[0][0]

    return {
        "score": float(similarity),
        "passed": similarity >= 0.8,
        "similarity": float(similarity)
    }

3. Length Validation

@evaluator
def length_check(outputs, inputs, ground_truth):
    """Validate output length is within acceptable range."""
    text = outputs.get("answer", "")
    word_count = len(text.split())

    min_words = inputs.get("min_words", 10)
    max_words = inputs.get("max_words", 100)

    in_range = min_words <= word_count <= max_words

    return {
        "score": 1.0 if in_range else 0.0,
        "word_count": word_count,
        "in_range": in_range,
        "min_words": min_words,
        "max_words": max_words
    }

4. Multi-Criteria Evaluation

@evaluator
def comprehensive_quality(outputs, inputs, ground_truth):
    """Evaluate multiple quality criteria."""
    answer = outputs.get("answer", "")

    # Individual criteria
    has_answer = len(answer) > 0
    correct_length = 50 <= len(answer) <= 200
    no_profanity = not contains_profanity(answer)
    factually_correct = check_facts(answer, ground_truth)

    # Weighted score
    criteria_scores = {
        "has_answer": 1.0 if has_answer else 0.0,
        "correct_length": 1.0 if correct_length else 0.5,
        "no_profanity": 1.0 if no_profanity else 0.0,
        "factually_correct": 1.0 if factually_correct else 0.0
    }

    # Average with weights
    weights = {"has_answer": 1, "correct_length": 1, "no_profanity": 2, "factually_correct": 3}
    total_weight = sum(weights.values())
    weighted_sum = sum(criteria_scores[k] * weights[k] for k in criteria_scores)
    final_score = weighted_sum / total_weight

    return {
        "score": final_score,
        "criteria_scores": criteria_scores,
        "all_passed": all(criteria_scores.values())
    }

5. LLM-as-Judge

import openai

@evaluator
def llm_judge(outputs, inputs, ground_truth):
    """Use an LLM to judge output quality."""
    client = openai.OpenAI()

    prompt = f"""
    Evaluate the following answer for accuracy and relevance.

    Question: {inputs['query']}
    Expected Answer: {ground_truth['answer']}
    Actual Answer: {outputs['answer']}

    Provide a score from 0.0 to 1.0 and explain your reasoning.


    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

    # Parse response (assumes structured format)
    result = parse_llm_response(response.choices[0].message.content)

    return {
        "score": result["score"],
        "reasoning": result["explanation"]
    }

Best Practices

1. Keep Evaluators Pure

Avoid side effects in evaluators:

# GOOD
@evaluator
def pure_evaluator(outputs, inputs, ground_truth):
    return {"score": calculate_score(outputs, ground_truth)}

# BAD - Has side effects
@evaluator
def impure_evaluator(outputs, inputs, ground_truth):
    database.save_result(outputs)  # Side effect!
    return {"score": 0.9}

2. Handle Missing Data Gracefully

@evaluator
def robust_evaluator(outputs, inputs, ground_truth):
    # Handle missing keys
    answer = outputs.get("answer", "")
    expected = ground_truth.get("answer", "") if ground_truth else ""

    if not answer:
        return {"score": 0.0, "error": "No answer provided"}

    if not expected:
        return {"score": 0.5, "warning": "No ground truth available"}

    return {"score": compare(answer, expected)}

3. Provide Detailed Metadata

@evaluator
def detailed_evaluator(outputs, inputs, ground_truth):
    score = calculate_score(outputs, ground_truth)

    return {
        "score": score,
        "passed": score >= 0.8,
        # Add debugging info
        "output_length": len(str(outputs)),
        "processing_method": "semantic_similarity",
        "confidence": calculate_confidence(score),
        "suggestions": generate_improvements(outputs, ground_truth) if score < 0.8 else None
    }

4. Use Timeouts for External Calls

import asyncio

@aevaluator
async def api_evaluator_with_timeout(outputs, inputs, ground_truth):
    try:
        # Set timeout
        async with asyncio.timeout(5.0):
            result = await external_api.validate(outputs)
            return {"score": result.score}
    except asyncio.TimeoutError:
        return {"score": 0.0, "error": "API timeout"}

See Also