Evaluators Reference

Complete reference for all evaluation classes and functions in HoneyHive.

Base Classes 

BaseEvaluator 

Base class for all custom evaluators.

class honeyhive.evaluation.evaluators.BaseEvaluator(name, **kwargs)[source]

Bases: object

Base class for custom evaluators.

Parameters:

name (str)
kwargs (Any)

__init__(name, **kwargs)[source]

Initialize the evaluator.

Parameters:

name (str)
kwargs (Any)

Return type:

None

evaluate(inputs, outputs, ground_truth=None, **kwargs)[source]

Evaluate the given inputs and outputs.

Parameters:

inputs (Dict[str, Any])
outputs (Dict[str, Any])
ground_truth (Dict[str, Any] | None)
kwargs (Any)

Return type:

Dict[str, Any]

__call__(inputs, outputs, ground_truth=None, **kwargs)[source]

Make the evaluator callable.

Parameters:

inputs (Dict[str, Any])
outputs (Dict[str, Any])
ground_truth (Dict[str, Any] | None)
kwargs (Any)

Return type:

Dict[str, Any]

Example

from honeyhive.evaluation import BaseEvaluator

class CustomEvaluator(BaseEvaluator):
    def __init__(self, threshold=0.5, **kwargs):
        super().__init__("custom_evaluator", **kwargs)
        self.threshold = threshold

    def evaluate(self, inputs, outputs, ground_truth=None, **kwargs):
        # Custom evaluation logic
        score = self._compute_score(outputs)
        return {
            "score": score,
            "passed": score >= self.threshold
        }

Built-in Evaluators 

ExactMatchEvaluator 

Evaluates exact string matching between expected and actual outputs.

class honeyhive.evaluation.evaluators.ExactMatchEvaluator(**kwargs)[source]

Bases: BaseEvaluator

Evaluator for exact string matching.

Parameters:: kwargs (Any)

__init__(**kwargs)[source]

Initialize the exact match evaluator.

Parameters:: kwargs (Any)
Return type:: None

evaluate(inputs, outputs, ground_truth=None, **kwargs)[source]

Evaluate exact match between expected and actual outputs.

Parameters:

inputs (Dict[str, Any])
outputs (Dict[str, Any])
ground_truth (Dict[str, Any] | None)
kwargs (Any)

Return type:

Dict[str, Any]

Description

The ExactMatchEvaluator checks if the actual output exactly matches the expected output. String comparisons are case-insensitive and whitespace is stripped.

Example

from honeyhive.evaluation import ExactMatchEvaluator

evaluator = ExactMatchEvaluator()

result = evaluator.evaluate(
    inputs={"expected": "The answer is 42"},
    outputs={"response": "The answer is 42"}
)
# Returns: {"exact_match": 1.0, "expected": "...", "actual": "..."}

# Case-insensitive matching
result = evaluator.evaluate(
    inputs={"expected": "hello"},
    outputs={"response": "HELLO"}
)
# Returns: {"exact_match": 1.0, ...}

F1ScoreEvaluator 

Evaluates F1 score for text similarity.

class honeyhive.evaluation.evaluators.F1ScoreEvaluator(**kwargs)[source]

Bases: BaseEvaluator

Evaluator for F1 score calculation.

Parameters:: kwargs (Any)

__init__(**kwargs)[source]

Initialize the F1 score evaluator.

Parameters:: kwargs (Any)
Return type:: None

evaluate(inputs, outputs, ground_truth=None, **kwargs)[source]

Evaluate F1 score between expected and actual outputs.

Parameters:

inputs (Dict[str, Any])
outputs (Dict[str, Any])
ground_truth (Dict[str, Any] | None)
kwargs (Any)

Return type:

Dict[str, Any]

Description

The F1ScoreEvaluator computes the F1 score between predicted and ground truth text based on word-level token overlap. It calculates precision and recall and combines them into an F1 score.

Formula

precision = |predicted_words ∩ ground_truth_words| / |predicted_words|
recall = |predicted_words ∩ ground_truth_words| / |ground_truth_words|
f1_score = 2 * (precision * recall) / (precision + recall)

Example

from honeyhive.evaluation import F1ScoreEvaluator

evaluator = F1ScoreEvaluator()

result = evaluator.evaluate(
    inputs={"expected": "the quick brown fox"},
    outputs={"response": "the fast brown fox"}
)
# Returns: {"f1_score": 0.75}  # 3 out of 4 words match

SemanticSimilarityEvaluator 

Evaluates semantic similarity using embeddings.

class honeyhive.evaluation.evaluators.SemanticSimilarityEvaluator(**kwargs)[source]

Bases: BaseEvaluator

Evaluator for semantic similarity using basic heuristics.

Parameters:: kwargs (Any)

__init__(**kwargs)[source]

Initialize the semantic similarity evaluator.

Parameters:: kwargs (Any)
Return type:: None

evaluate(inputs, outputs, ground_truth=None, **kwargs)[source]

Evaluate semantic similarity between expected and actual outputs.

Parameters:

inputs (Dict[str, Any])
outputs (Dict[str, Any])
ground_truth (Dict[str, Any] | None)
kwargs (Any)

Return type:

Dict[str, Any]

Description

The SemanticSimilarityEvaluator uses embeddings to compute semantic similarity between texts. This is more sophisticated than exact match or F1 score as it understands meaning rather than just token overlap.

Example

from honeyhive.evaluation import SemanticSimilarityEvaluator

evaluator = SemanticSimilarityEvaluator(
    embedding_model="text-embedding-ada-002",
    threshold=0.8
)

result = evaluator.evaluate(
    inputs={"expected": "The weather is nice today"},
    outputs={"response": "It's a beautiful day outside"}
)
# Returns: {"similarity": 0.85, "passed": True}

Evaluation Decorators 

evaluator 

Decorator for defining synchronous evaluators.

honeyhive.evaluation.evaluators.evaluator(_name=None, _session_id=None, **_kwargs)[source]

Decorator for synchronous evaluation functions.

Parameters:

name – Evaluation name
session_id – Session ID for tracing
**kwargs – Additional evaluation parameters
_name (str | None)
_session_id (str | None)
_kwargs (Any)

Return type:

Callable[[Callable], Callable]

Description

The evaluator decorator converts a regular function into an evaluator that can be used with the HoneyHive evaluation system.

Example

from honeyhive import evaluator

@evaluator
def length_check(inputs, outputs, ground_truth=None, min_length=10):
    """Check if output meets minimum length requirement."""
    text = outputs.get("response", "")
    length = len(text)

    return {
        "length": length,
        "meets_minimum": length >= min_length,
        "score": 1.0 if length >= min_length else 0.0
    }

# Use in evaluation
from honeyhive import evaluate

results = evaluate(
    data=[{"input": "test"}],
    task=lambda x: {"response": "short"},
    evaluators=[length_check]
)

aevaluator 

Decorator for defining asynchronous evaluators.

honeyhive.evaluation.evaluators.aevaluator(_name=None, _session_id=None, **_kwargs)[source]

Decorator for asynchronous evaluation functions.

Parameters:

name – Evaluation name
session_id – Session ID for tracing
**kwargs – Additional evaluation parameters
_name (str | None)
_session_id (str | None)
_kwargs (Any)

Return type:

Callable[[Callable], Callable]

EvaluatorMeta 

Metaclass for evaluator type handling.

class honeyhive.experiments.evaluators.EvaluatorMeta[source]

Bases: type

Metaclass for evaluator accessor pattern.

TerminalColors 

Terminal color constants for formatted output.

class honeyhive.experiments.evaluators.TerminalColors[source]

Bases: object

ANSI terminal color codes for output formatting.

HEADER = '\x1b[95m'

OKBLUE = '\x1b[94m'

OKCYAN = '\x1b[96m'

OKGREEN = '\x1b[92m'

WARNING = '\x1b[93m'

FAIL = '\x1b[91m'

ENDC = '\x1b[0m'

BOLD = '\x1b[1m'

UNDERLINE = '\x1b[4m'

Description

The aevaluator decorator is used for async evaluators that need to make asynchronous calls (e.g., API calls for LLM-based evaluation).

Example

from honeyhive import aevaluator
import aiohttp

@aevaluator
async def llm_grader(inputs, outputs, ground_truth=None):
    """Use an LLM to grade the output."""
    async with aiohttp.ClientSession() as session:
        async with session.post(
            "https://api.openai.com/v1/chat/completions",
            json={
                "model": "gpt-4",
                "messages": [{
                    "role": "user",
                    "content": f"Grade this output: {outputs['response']}"
                }]
            }
        ) as response:
            result = await response.json()
            grade = parse_grade(result)

            return {
                "grade": grade,
                "score": grade / 100.0
            }

Data Models 

EvaluationResult 

Result model for evaluation outputs.

class honeyhive.evaluation.evaluators.EvaluationResult(score, metrics, feedback=None, metadata=None, evaluation_id=<factory>, timestamp=None)[source]

Bases: object

Result of an evaluation.

Parameters:

score (float)
metrics (Dict[str, Any])
feedback (str | None)
metadata (Dict[str, Any] | None)
evaluation_id (str)
timestamp (str | None)

score: float

metrics: Dict[str, Any]

feedback: str | None = None

metadata: Dict[str, Any] | None = None

evaluation_id: str

timestamp: str | None = None

Fields

score (float): Numeric score from evaluation
metrics (Dict[str, Any]): Additional metrics
feedback (Optional[str]): Text feedback
metadata (Optional[Dict[str, Any]]): Additional metadata
evaluation_id (str): Unique ID for this evaluation
timestamp (Optional[str]): Timestamp of evaluation

Example

from honeyhive.evaluation import EvaluationResult

result = EvaluationResult(
    score=0.85,
    metrics={"accuracy": 0.9, "latency": 250},
    feedback="Good response, minor improvements possible",
    metadata={"model": "gpt-4", "version": "1.0"}
)

EvaluationContext 

Context information for evaluation runs.

class honeyhive.evaluation.evaluators.EvaluationContext(project, source, session_id=None, metadata=None)[source]

Bases: object

Context for evaluation runs.

Parameters:

project (str)
source (str)
session_id (str | None)
metadata (Dict[str, Any] | None)

project: str

source: str

session_id: str | None = None

metadata: Dict[str, Any] | None = None

Fields

project (str): Project name
source (str): Source of evaluation
session_id (Optional[str]): Session identifier
metadata (Optional[Dict[str, Any]]): Additional context

Example

from honeyhive.evaluation import EvaluationContext

context = EvaluationContext(
    project="my-llm-app",
    source="production",
    session_id="session-123",
    metadata={"user_id": "user-456"}
)

Evaluation Functions 

evaluate 

Main function for running evaluations.

honeyhive.evaluation.evaluators.evaluate(prediction, ground_truth, metrics=None, **kwargs)[source]

Evaluate a prediction against ground truth.

Parameters:

prediction (str) – Model prediction
ground_truth (str) – Ground truth value
metrics (List[str] | None) – List of metrics to compute
**kwargs (Any) – Additional evaluation parameters

Returns:

Evaluation result

Return type:

EvaluationResult

Description

The evaluate function runs a set of evaluators on your task outputs, collecting metrics and results for analysis.

Parameters

data (List[Dict]): Input data for evaluation
task (Callable): Function that produces outputs
evaluators (List): List of evaluator functions or objects
project (str, optional): Project name
run_name (str, optional): Name for this evaluation run
metadata (Dict, optional): Additional metadata

Returns

Dict containing: - results: List of evaluation results - metrics: Aggregated metrics - summary: Summary statistics

Example

from honeyhive import evaluate, evaluator

@evaluator
def check_length(inputs, outputs, min_words=5):
    words = len(outputs["response"].split())
    return {
        "word_count": words,
        "meets_minimum": words >= min_words,
        "score": 1.0 if words >= min_words else 0.0
    }

# Define your task
def my_task(input_data):
    # Your LLM logic here
    return {"response": "Generated response"}

# Run evaluation
results = evaluate(
    data=[
        {"prompt": "What is AI?"},
        {"prompt": "Explain ML"},
    ],
    task=my_task,
    evaluators=[check_length],
    project="my-project",
    run_name="baseline-eval"
)

print(f"Average score: {results['metrics']['average_score']}")
print(f"Pass rate: {results['metrics']['pass_rate']}")