Evaluators Reference

Complete reference for all evaluation classes and functions in HoneyHive.

Base Classes

BaseEvaluator

Base class for all custom evaluators.

class honeyhive.evaluation.evaluators.BaseEvaluator(name, **kwargs)[source]

Bases: object

Base class for custom evaluators.

Parameters:
__init__(name, **kwargs)[source]

Initialize the evaluator.

Parameters:
Return type:

None

evaluate(inputs, outputs, ground_truth=None, **kwargs)[source]

Evaluate the given inputs and outputs.

Parameters:
Return type:

Dict[str, Any]

__call__(inputs, outputs, ground_truth=None, **kwargs)[source]

Make the evaluator callable.

Parameters:
Return type:

Dict[str, Any]

Example

from honeyhive.evaluation import BaseEvaluator

class CustomEvaluator(BaseEvaluator):
    def __init__(self, threshold=0.5, **kwargs):
        super().__init__("custom_evaluator", **kwargs)
        self.threshold = threshold

    def evaluate(self, inputs, outputs, ground_truth=None, **kwargs):
        # Custom evaluation logic
        score = self._compute_score(outputs)
        return {
            "score": score,
            "passed": score >= self.threshold
        }

Built-in Evaluators

ExactMatchEvaluator

Evaluates exact string matching between expected and actual outputs.

class honeyhive.evaluation.evaluators.ExactMatchEvaluator(**kwargs)[source]

Bases: BaseEvaluator

Evaluator for exact string matching.

Parameters:

kwargs (Any)

__init__(**kwargs)[source]

Initialize the exact match evaluator.

Parameters:

kwargs (Any)

Return type:

None

evaluate(inputs, outputs, ground_truth=None, **kwargs)[source]

Evaluate exact match between expected and actual outputs.

Parameters:
Return type:

Dict[str, Any]

Description

The ExactMatchEvaluator checks if the actual output exactly matches the expected output. String comparisons are case-insensitive and whitespace is stripped.

Example

from honeyhive.evaluation import ExactMatchEvaluator

evaluator = ExactMatchEvaluator()

result = evaluator.evaluate(
    inputs={"expected": "The answer is 42"},
    outputs={"response": "The answer is 42"}
)
# Returns: {"exact_match": 1.0, "expected": "...", "actual": "..."}

# Case-insensitive matching
result = evaluator.evaluate(
    inputs={"expected": "hello"},
    outputs={"response": "HELLO"}
)
# Returns: {"exact_match": 1.0, ...}

F1ScoreEvaluator

Evaluates F1 score for text similarity.

class honeyhive.evaluation.evaluators.F1ScoreEvaluator(**kwargs)[source]

Bases: BaseEvaluator

Evaluator for F1 score calculation.

Parameters:

kwargs (Any)

__init__(**kwargs)[source]

Initialize the F1 score evaluator.

Parameters:

kwargs (Any)

Return type:

None

evaluate(inputs, outputs, ground_truth=None, **kwargs)[source]

Evaluate F1 score between expected and actual outputs.

Parameters:
Return type:

Dict[str, Any]

Description

The F1ScoreEvaluator computes the F1 score between predicted and ground truth text based on word-level token overlap. It calculates precision and recall and combines them into an F1 score.

Formula

precision = |predicted_words ∩ ground_truth_words| / |predicted_words|
recall = |predicted_words ∩ ground_truth_words| / |ground_truth_words|
f1_score = 2 * (precision * recall) / (precision + recall)

Example

from honeyhive.evaluation import F1ScoreEvaluator

evaluator = F1ScoreEvaluator()

result = evaluator.evaluate(
    inputs={"expected": "the quick brown fox"},
    outputs={"response": "the fast brown fox"}
)
# Returns: {"f1_score": 0.75}  # 3 out of 4 words match

SemanticSimilarityEvaluator

Evaluates semantic similarity using embeddings.

class honeyhive.evaluation.evaluators.SemanticSimilarityEvaluator(**kwargs)[source]

Bases: BaseEvaluator

Evaluator for semantic similarity using basic heuristics.

Parameters:

kwargs (Any)

__init__(**kwargs)[source]

Initialize the semantic similarity evaluator.

Parameters:

kwargs (Any)

Return type:

None

evaluate(inputs, outputs, ground_truth=None, **kwargs)[source]

Evaluate semantic similarity between expected and actual outputs.

Parameters:
Return type:

Dict[str, Any]

Description

The SemanticSimilarityEvaluator uses embeddings to compute semantic similarity between texts. This is more sophisticated than exact match or F1 score as it understands meaning rather than just token overlap.

Example

from honeyhive.evaluation import SemanticSimilarityEvaluator

evaluator = SemanticSimilarityEvaluator(
    embedding_model="text-embedding-ada-002",
    threshold=0.8
)

result = evaluator.evaluate(
    inputs={"expected": "The weather is nice today"},
    outputs={"response": "It's a beautiful day outside"}
)
# Returns: {"similarity": 0.85, "passed": True}

Evaluation Decorators

evaluator

Decorator for defining synchronous evaluators.

honeyhive.evaluation.evaluators.evaluator(_name=None, _session_id=None, **_kwargs)[source]

Decorator for synchronous evaluation functions.

Parameters:
  • name – Evaluation name

  • session_id – Session ID for tracing

  • **kwargs – Additional evaluation parameters

  • _name (str | None)

  • _session_id (str | None)

  • _kwargs (Any)

Return type:

Callable[[Callable], Callable]

Description

The evaluator decorator converts a regular function into an evaluator that can be used with the HoneyHive evaluation system.

Example

from honeyhive import evaluator

@evaluator
def length_check(inputs, outputs, ground_truth=None, min_length=10):
    """Check if output meets minimum length requirement."""
    text = outputs.get("response", "")
    length = len(text)

    return {
        "length": length,
        "meets_minimum": length >= min_length,
        "score": 1.0 if length >= min_length else 0.0
    }

# Use in evaluation
from honeyhive import evaluate

results = evaluate(
    data=[{"input": "test"}],
    task=lambda x: {"response": "short"},
    evaluators=[length_check]
)

aevaluator

Decorator for defining asynchronous evaluators.

honeyhive.evaluation.evaluators.aevaluator(_name=None, _session_id=None, **_kwargs)[source]

Decorator for asynchronous evaluation functions.

Parameters:
  • name – Evaluation name

  • session_id – Session ID for tracing

  • **kwargs – Additional evaluation parameters

  • _name (str | None)

  • _session_id (str | None)

  • _kwargs (Any)

Return type:

Callable[[Callable], Callable]

EvaluatorMeta

Metaclass for evaluator type handling.

class honeyhive.experiments.evaluators.EvaluatorMeta[source]

Bases: type

Metaclass for evaluator accessor pattern.

TerminalColors

Terminal color constants for formatted output.

class honeyhive.experiments.evaluators.TerminalColors[source]

Bases: object

ANSI terminal color codes for output formatting.

HEADER = '\x1b[95m'
OKBLUE = '\x1b[94m'
OKCYAN = '\x1b[96m'
OKGREEN = '\x1b[92m'
WARNING = '\x1b[93m'
FAIL = '\x1b[91m'
ENDC = '\x1b[0m'
BOLD = '\x1b[1m'
UNDERLINE = '\x1b[4m'

Description

The aevaluator decorator is used for async evaluators that need to make asynchronous calls (e.g., API calls for LLM-based evaluation).

Example

from honeyhive import aevaluator
import aiohttp

@aevaluator
async def llm_grader(inputs, outputs, ground_truth=None):
    """Use an LLM to grade the output."""
    async with aiohttp.ClientSession() as session:
        async with session.post(
            "https://api.openai.com/v1/chat/completions",
            json={
                "model": "gpt-4",
                "messages": [{
                    "role": "user",
                    "content": f"Grade this output: {outputs['response']}"
                }]
            }
        ) as response:
            result = await response.json()
            grade = parse_grade(result)

            return {
                "grade": grade,
                "score": grade / 100.0
            }

Data Models

EvaluationResult

Result model for evaluation outputs.

class honeyhive.evaluation.evaluators.EvaluationResult(score, metrics, feedback=None, metadata=None, evaluation_id=<factory>, timestamp=None)[source]

Bases: object

Result of an evaluation.

Parameters:
score: float
metrics: Dict[str, Any]
feedback: str | None = None
metadata: Dict[str, Any] | None = None
evaluation_id: str
timestamp: str | None = None

Fields

  • score (float): Numeric score from evaluation

  • metrics (Dict[str, Any]): Additional metrics

  • feedback (Optional[str]): Text feedback

  • metadata (Optional[Dict[str, Any]]): Additional metadata

  • evaluation_id (str): Unique ID for this evaluation

  • timestamp (Optional[str]): Timestamp of evaluation

Example

from honeyhive.evaluation import EvaluationResult

result = EvaluationResult(
    score=0.85,
    metrics={"accuracy": 0.9, "latency": 250},
    feedback="Good response, minor improvements possible",
    metadata={"model": "gpt-4", "version": "1.0"}
)

EvaluationContext

Context information for evaluation runs.

class honeyhive.evaluation.evaluators.EvaluationContext(project, source, session_id=None, metadata=None)[source]

Bases: object

Context for evaluation runs.

Parameters:
project: str
source: str
session_id: str | None = None
metadata: Dict[str, Any] | None = None

Fields

  • project (str): Project name

  • source (str): Source of evaluation

  • session_id (Optional[str]): Session identifier

  • metadata (Optional[Dict[str, Any]]): Additional context

Example

from honeyhive.evaluation import EvaluationContext

context = EvaluationContext(
    project="my-llm-app",
    source="production",
    session_id="session-123",
    metadata={"user_id": "user-456"}
)

Evaluation Functions

evaluate

Main function for running evaluations.

honeyhive.evaluation.evaluators.evaluate(prediction, ground_truth, metrics=None, **kwargs)[source]

Evaluate a prediction against ground truth.

Parameters:
  • prediction (str) – Model prediction

  • ground_truth (str) – Ground truth value

  • metrics (List[str] | None) – List of metrics to compute

  • **kwargs (Any) – Additional evaluation parameters

Returns:

Evaluation result

Return type:

EvaluationResult

Description

The evaluate function runs a set of evaluators on your task outputs, collecting metrics and results for analysis.

Parameters

  • data (List[Dict]): Input data for evaluation

  • task (Callable): Function that produces outputs

  • evaluators (List): List of evaluator functions or objects

  • project (str, optional): Project name

  • run_name (str, optional): Name for this evaluation run

  • metadata (Dict, optional): Additional metadata

Returns

Dict containing: - results: List of evaluation results - metrics: Aggregated metrics - summary: Summary statistics

Example

from honeyhive import evaluate, evaluator

@evaluator
def check_length(inputs, outputs, min_words=5):
    words = len(outputs["response"].split())
    return {
        "word_count": words,
        "meets_minimum": words >= min_words,
        "score": 1.0 if words >= min_words else 0.0
    }

# Define your task
def my_task(input_data):
    # Your LLM logic here
    return {"response": "Generated response"}

# Run evaluation
results = evaluate(
    data=[
        {"prompt": "What is AI?"},
        {"prompt": "Explain ML"},
    ],
    task=my_task,
    evaluators=[check_length],
    project="my-project",
    run_name="baseline-eval"
)

print(f"Average score: {results['metrics']['average_score']}")
print(f"Pass rate: {results['metrics']['pass_rate']}")

See Also