Evaluators Reference ==================== Complete reference for all evaluation classes and functions in HoneyHive. .. contents:: Table of Contents :local: :depth: 2 Base Classes ------------ BaseEvaluator ~~~~~~~~~~~~~ Base class for all custom evaluators. .. autoclass:: honeyhive.evaluation.evaluators.BaseEvaluator :members: :undoc-members: :show-inheritance: :special-members: __init__, __call__ Example ^^^^^^^ .. code-block:: python from honeyhive.evaluation import BaseEvaluator class CustomEvaluator(BaseEvaluator): def __init__(self, threshold=0.5, **kwargs): super().__init__("custom_evaluator", **kwargs) self.threshold = threshold def evaluate(self, inputs, outputs, ground_truth=None, **kwargs): # Custom evaluation logic score = self._compute_score(outputs) return { "score": score, "passed": score >= self.threshold } Built-in Evaluators ------------------- ExactMatchEvaluator ~~~~~~~~~~~~~~~~~~~ Evaluates exact string matching between expected and actual outputs. .. autoclass:: honeyhive.evaluation.evaluators.ExactMatchEvaluator :members: :undoc-members: :show-inheritance: :special-members: __init__ Description ^^^^^^^^^^^ The ExactMatchEvaluator checks if the actual output exactly matches the expected output. String comparisons are case-insensitive and whitespace is stripped. Example ^^^^^^^ .. code-block:: python from honeyhive.evaluation import ExactMatchEvaluator evaluator = ExactMatchEvaluator() result = evaluator.evaluate( inputs={"expected": "The answer is 42"}, outputs={"response": "The answer is 42"} ) # Returns: {"exact_match": 1.0, "expected": "...", "actual": "..."} # Case-insensitive matching result = evaluator.evaluate( inputs={"expected": "hello"}, outputs={"response": "HELLO"} ) # Returns: {"exact_match": 1.0, ...} F1ScoreEvaluator ~~~~~~~~~~~~~~~~ Evaluates F1 score for text similarity. .. autoclass:: honeyhive.evaluation.evaluators.F1ScoreEvaluator :members: :undoc-members: :show-inheritance: :special-members: __init__ Description ^^^^^^^^^^^ The F1ScoreEvaluator computes the F1 score between predicted and ground truth text based on word-level token overlap. It calculates precision and recall and combines them into an F1 score. Formula ^^^^^^^ .. code-block:: text precision = |predicted_words ∩ ground_truth_words| / |predicted_words| recall = |predicted_words ∩ ground_truth_words| / |ground_truth_words| f1_score = 2 * (precision * recall) / (precision + recall) Example ^^^^^^^ .. code-block:: python from honeyhive.evaluation import F1ScoreEvaluator evaluator = F1ScoreEvaluator() result = evaluator.evaluate( inputs={"expected": "the quick brown fox"}, outputs={"response": "the fast brown fox"} ) # Returns: {"f1_score": 0.75} # 3 out of 4 words match SemanticSimilarityEvaluator ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Evaluates semantic similarity using embeddings. .. autoclass:: honeyhive.evaluation.evaluators.SemanticSimilarityEvaluator :members: :undoc-members: :show-inheritance: :special-members: __init__ Description ^^^^^^^^^^^ The SemanticSimilarityEvaluator uses embeddings to compute semantic similarity between texts. This is more sophisticated than exact match or F1 score as it understands meaning rather than just token overlap. Example ^^^^^^^ .. code-block:: python from honeyhive.evaluation import SemanticSimilarityEvaluator evaluator = SemanticSimilarityEvaluator( embedding_model="text-embedding-ada-002", threshold=0.8 ) result = evaluator.evaluate( inputs={"expected": "The weather is nice today"}, outputs={"response": "It's a beautiful day outside"} ) # Returns: {"similarity": 0.85, "passed": True} Evaluation Decorators --------------------- evaluator ~~~~~~~~~ Decorator for defining synchronous evaluators. .. autofunction:: honeyhive.evaluation.evaluators.evaluator Description ^^^^^^^^^^^ The ``evaluator`` decorator converts a regular function into an evaluator that can be used with the HoneyHive evaluation system. Example ^^^^^^^ .. code-block:: python from honeyhive import evaluator @evaluator def length_check(inputs, outputs, ground_truth=None, min_length=10): """Check if output meets minimum length requirement.""" text = outputs.get("response", "") length = len(text) return { "length": length, "meets_minimum": length >= min_length, "score": 1.0 if length >= min_length else 0.0 } # Use in evaluation from honeyhive import evaluate results = evaluate( data=[{"input": "test"}], task=lambda x: {"response": "short"}, evaluators=[length_check] ) aevaluator ~~~~~~~~~~ Decorator for defining asynchronous evaluators. .. autofunction:: honeyhive.evaluation.evaluators.aevaluator EvaluatorMeta ~~~~~~~~~~~~~ Metaclass for evaluator type handling. .. autoclass:: honeyhive.experiments.evaluators.EvaluatorMeta :members: :undoc-members: :show-inheritance: TerminalColors ~~~~~~~~~~~~~~ Terminal color constants for formatted output. .. autoclass:: honeyhive.experiments.evaluators.TerminalColors :members: :undoc-members: :show-inheritance: Description ^^^^^^^^^^^ The ``aevaluator`` decorator is used for async evaluators that need to make asynchronous calls (e.g., API calls for LLM-based evaluation). Example ^^^^^^^ .. code-block:: python from honeyhive import aevaluator import aiohttp @aevaluator async def llm_grader(inputs, outputs, ground_truth=None): """Use an LLM to grade the output.""" async with aiohttp.ClientSession() as session: async with session.post( "https://api.openai.com/v1/chat/completions", json={ "model": "gpt-4", "messages": [{ "role": "user", "content": f"Grade this output: {outputs['response']}" }] } ) as response: result = await response.json() grade = parse_grade(result) return { "grade": grade, "score": grade / 100.0 } Data Models ----------- EvaluationResult ~~~~~~~~~~~~~~~~ Result model for evaluation outputs. .. autoclass:: honeyhive.evaluation.evaluators.EvaluationResult :members: :undoc-members: :show-inheritance: Fields ^^^^^^ - **score** (float): Numeric score from evaluation - **metrics** (Dict[str, Any]): Additional metrics - **feedback** (Optional[str]): Text feedback - **metadata** (Optional[Dict[str, Any]]): Additional metadata - **evaluation_id** (str): Unique ID for this evaluation - **timestamp** (Optional[str]): Timestamp of evaluation Example ^^^^^^^ .. code-block:: python from honeyhive.evaluation import EvaluationResult result = EvaluationResult( score=0.85, metrics={"accuracy": 0.9, "latency": 250}, feedback="Good response, minor improvements possible", metadata={"model": "gpt-4", "version": "1.0"} ) EvaluationContext ~~~~~~~~~~~~~~~~~ Context information for evaluation runs. .. autoclass:: honeyhive.evaluation.evaluators.EvaluationContext :members: :undoc-members: :show-inheritance: Fields ^^^^^^ - **project** (str): Project name - **source** (str): Source of evaluation - **session_id** (Optional[str]): Session identifier - **metadata** (Optional[Dict[str, Any]]): Additional context Example ^^^^^^^ .. code-block:: python from honeyhive.evaluation import EvaluationContext context = EvaluationContext( project="my-llm-app", source="production", session_id="session-123", metadata={"user_id": "user-456"} ) Evaluation Functions -------------------- evaluate ~~~~~~~~ Main function for running evaluations. .. autofunction:: honeyhive.evaluation.evaluators.evaluate Description ^^^^^^^^^^^ The ``evaluate`` function runs a set of evaluators on your task outputs, collecting metrics and results for analysis. Parameters ^^^^^^^^^^ - **data** (List[Dict]): Input data for evaluation - **task** (Callable): Function that produces outputs - **evaluators** (List): List of evaluator functions or objects - **project** (str, optional): Project name - **run_name** (str, optional): Name for this evaluation run - **metadata** (Dict, optional): Additional metadata Returns ^^^^^^^ Dict containing: - **results**: List of evaluation results - **metrics**: Aggregated metrics - **summary**: Summary statistics Example ^^^^^^^ .. code-block:: python from honeyhive import evaluate, evaluator @evaluator def check_length(inputs, outputs, min_words=5): words = len(outputs["response"].split()) return { "word_count": words, "meets_minimum": words >= min_words, "score": 1.0 if words >= min_words else 0.0 } # Define your task def my_task(input_data): # Your LLM logic here return {"response": "Generated response"} # Run evaluation results = evaluate( data=[ {"prompt": "What is AI?"}, {"prompt": "Explain ML"}, ], task=my_task, evaluators=[check_length], project="my-project", run_name="baseline-eval" ) print(f"Average score: {results['metrics']['average_score']}") print(f"Pass rate: {results['metrics']['pass_rate']}") See Also -------- - :doc:`/reference/experiments/experiments` - Experiments API - :doc:`/tutorials/05-run-first-experiment` - Evaluation tutorial - :doc:`/how-to/evaluation/creating-evaluators` - Creating custom evaluators - :doc:`/how-to/evaluation/best-practices` - Evaluation best practices