Creating Evaluators
===================
How do I create custom metrics to score my LLM outputs?
-------------------------------------------------------
Use the ``@evaluator`` decorator to create scoring functions.
What's the simplest evaluator I can create?
-------------------------------------------
**Simple Function with @evaluator Decorator**
.. code-block:: python
from honeyhive.experiments import evaluator
@evaluator()
def exact_match(outputs, inputs, ground_truth):
"""Check if output matches expected result."""
expected = ground_truth.get("answer", "")
actual = outputs.get("answer", "")
# Return a score (0.0 to 1.0)
return 1.0 if actual == expected else 0.0
**Use it in evaluate():**
.. code-block:: python
from typing import Any, Dict
from honeyhive.experiments import evaluate, evaluator
# Your evaluation function
def my_llm_app(datapoint: Dict[str, Any]) -> Dict[str, Any]:
"""Processes datapoint and returns outputs."""
inputs = datapoint.get("inputs", {})
result = call_llm(inputs["prompt"])
return {"answer": result} # This becomes 'outputs' in evaluator
# Your evaluator
@evaluator()
def exact_match(outputs, inputs, ground_truth):
"""Evaluator receives output from my_llm_app + datapoint context."""
# outputs = {"answer": result} from my_llm_app
# inputs = datapoint["inputs"]
# ground_truth = datapoint["ground_truth"]
expected = ground_truth.get("answer", "")
actual = outputs.get("answer", "")
return 1.0 if actual == expected else 0.0
# Run evaluation
result = evaluate(
function=my_llm_app, # Produces 'outputs'
dataset=dataset, # Contains 'inputs' and 'ground_truth'
evaluators=[exact_match], # Receives all three
api_key="your-api-key",
project="your-project"
)
.. important::
**How Evaluators Are Invoked**
For each datapoint in your dataset, ``evaluate()`` does the following:
1. **Calls your evaluation function** with the datapoint
2. **Gets the output** (return value from your function)
3. **Invokes each evaluator** with:
- ``outputs`` = return value from your evaluation function
- ``inputs`` = ``datapoint["inputs"]`` from the dataset
- ``ground_truth`` = ``datapoint["ground_truth"]`` from the dataset
This allows evaluators to compare what your function produced (``outputs``) against what was expected (``ground_truth``), with access to the original inputs for context.
**Visual Flow Diagram**
.. mermaid::
%%{init: {'theme':'base', 'themeVariables': {'primaryColor': '#4F81BD', 'primaryTextColor': '#ffffff', 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'mainBkg': 'transparent', 'secondBkg': 'transparent', 'tertiaryColor': 'transparent', 'clusterBkg': 'transparent', 'clusterBorder': '#333333', 'edgeLabelBackground': 'transparent', 'background': 'transparent'}, 'flowchart': {'linkColor': '#333333', 'linkWidth': 2}}}%%
flowchart TD
Start([Dataset with Datapoints]) --> Loop{For Each Datapoint}
Loop --> Extract[Extract Components:
inputs = datapoint-inputs
ground_truth = datapoint-ground_truth]
Extract --> EvalFunc[Call Evaluation Function
my_llm_app-datapoint]
EvalFunc --> Output[Function Returns:
outputs = answer-result]
Output --> Evaluator[Call Each Evaluator
evaluator-outputs-inputs-ground_truth]
Evaluator --> Score[Evaluator Returns:
score or score-metadata]
Score --> Store[Store Results in HoneyHive]
Store --> Loop
Loop -->|Done| End([Experiment Complete])
classDef startEnd fill:#1565c0,stroke:#333333,stroke-width:2px,color:#ffffff
classDef process fill:#42a5f5,stroke:#333333,stroke-width:2px,color:#ffffff
classDef action fill:#7b1fa2,stroke:#333333,stroke-width:2px,color:#ffffff
classDef success fill:#2e7d32,stroke:#333333,stroke-width:2px,color:#ffffff
class Start,End startEnd
class Extract,Output,Store process
class EvalFunc action
class Evaluator success
**Example Mapping:**
.. code-block:: python
# Dataset datapoint
datapoint = {
"inputs": {"prompt": "What is AI?"},
"ground_truth": {"answer": "Artificial Intelligence"}
}
# Step 1: evaluate() calls your function
outputs = my_llm_app(datapoint)
# outputs = {"answer": "AI is Artificial Intelligence"}
# Step 2: evaluate() calls your evaluator
score = exact_match(
outputs=outputs, # From function
inputs=datapoint["inputs"], # From dataset
ground_truth=datapoint["ground_truth"] # From dataset
)
# score = 1.0 (match found)
What parameters must my evaluator accept?
-----------------------------------------
**(outputs, inputs, ground_truth) in That Order**
.. code-block:: python
@evaluator()
def my_evaluator(outputs, inputs, ground_truth):
"""Evaluator function.
Args:
outputs (dict): Return value from your function
inputs (dict): Inputs from the datapoint
ground_truth (dict): Expected outputs from datapoint
Returns:
float or dict: Score or detailed results
"""
# Your scoring logic
score = calculate_score(outputs, ground_truth)
return score
.. important::
**Parameter Order Matters!**
1. ``outputs`` (required) - What your function returned
2. ``inputs`` (optional) - Original inputs
3. ``ground_truth`` (optional) - Expected outputs
What can my evaluator return?
-----------------------------
**Float, Bool, or Dict**
.. code-block:: python
# Option 1: Return float (score only)
@evaluator()
def simple_score(outputs, inputs, ground_truth):
return 0.85 # Score between 0.0 and 1.0
# Option 2: Return bool (pass/fail)
@evaluator()
def pass_fail(outputs, inputs, ground_truth):
return len(outputs["answer"]) > 10 # Converts to 1.0 or 0.0
# Option 3: Return dict (RECOMMENDED - most informative)
@evaluator()
def detailed_score(outputs, inputs, ground_truth):
score = calculate_score(outputs)
return {
"score": score, # Required: 0.0 to 1.0
"passed": score >= 0.8,
"details": "answer too short",
"confidence": 0.95
}
Common Evaluator Patterns
-------------------------
**Exact Match**
.. code-block:: python
@evaluator()
def exact_match(outputs, inputs, ground_truth):
"""Check for exact string match."""
expected = ground_truth.get("answer", "").lower().strip()
actual = outputs.get("answer", "").lower().strip()
return {
"score": 1.0 if actual == expected else 0.0,
"matched": actual == expected,
"expected": expected,
"actual": actual
}
**Length Check**
.. code-block:: python
@evaluator()
def length_check(outputs, inputs, ground_truth):
"""Validate output length."""
text = outputs.get("answer", "")
word_count = len(text.split())
min_words = inputs.get("min_words", 10)
max_words = inputs.get("max_words", 200)
in_range = min_words <= word_count <= max_words
return {
"score": 1.0 if in_range else 0.5,
"word_count": word_count,
"in_range": in_range
}
**Contains Keywords**
.. code-block:: python
@evaluator()
def keyword_check(outputs, inputs, ground_truth):
"""Check if output contains required keywords."""
answer = outputs.get("answer", "").lower()
required_keywords = inputs.get("keywords", [])
found = [kw for kw in required_keywords if kw.lower() in answer]
score = len(found) / len(required_keywords) if required_keywords else 0.0
return {
"score": score,
"found_keywords": found,
"missing_keywords": list(set(required_keywords) - set(found))
}
How do I create evaluators with custom parameters?
--------------------------------------------------
**Use Factory Functions**
.. code-block:: python
def create_length_evaluator(min_words: int, max_words: int):
"""Factory for length evaluators with custom thresholds."""
@evaluator(name=f"length_{min_words}_{max_words}")
def length_validator(outputs, inputs, ground_truth):
text = outputs.get("answer", "")
word_count = len(text.split())
in_range = min_words <= word_count <= max_words
return {
"score": 1.0 if in_range else 0.5,
"word_count": word_count,
"target_range": f"{min_words}-{max_words}"
}
return length_validator
# Create different length checkers
short_answer = create_length_evaluator(10, 50)
medium_answer = create_length_evaluator(50, 200)
long_answer = create_length_evaluator(200, 1000)
# Use in evaluation
result = evaluate(
function=my_function,
dataset=dataset,
evaluators=[short_answer], # Use the configured evaluator
api_key="your-api-key",
project="your-project"
)
How do I use an LLM to evaluate quality?
----------------------------------------
**Call LLM in Evaluator Function**
.. code-block:: python
import openai
@evaluator()
def llm_judge(outputs, inputs, ground_truth):
"""Use GPT-4 to judge answer quality."""
client = openai.OpenAI()
prompt = f"""
Rate this answer on a scale of 0.0 to 1.0.
Question: {inputs['question']}
Expected: {ground_truth['answer']}
Actual: {outputs['answer']}
Consider: accuracy, completeness, clarity.
Respond with ONLY a JSON object:
{{"score": 0.0-1.0, "reasoning": "brief explanation"}}
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.0, # Deterministic
response_format={"type": "json_object"}
)
import json
result = json.loads(response.choices[0].message.content)
return result
.. warning::
**Cost Consideration**: LLM-as-judge evaluators make API calls for each datapoint.
- 100 datapoints = 100 GPT-4 calls
- Consider using cheaper models for large datasets
- Or use sampling: only evaluate subset of data
How do I check multiple quality dimensions?
-------------------------------------------
**Weighted Scoring Across Criteria**
.. code-block:: python
@evaluator()
def comprehensive_quality(outputs, inputs, ground_truth):
"""Evaluate multiple quality dimensions."""
answer = outputs.get("answer", "")
# Individual criteria
has_answer = len(answer) > 0
correct_length = 50 <= len(answer) <= 200
no_profanity = not contains_profanity(answer) # Your function
factually_correct = check_facts(answer, ground_truth) # Your function
# Individual scores
criteria_scores = {
"has_answer": 1.0 if has_answer else 0.0,
"correct_length": 1.0 if correct_length else 0.5,
"no_profanity": 1.0 if no_profanity else 0.0,
"factually_correct": 1.0 if factually_correct else 0.0
}
# Weighted average (adjust weights for your use case)
weights = {
"has_answer": 1,
"correct_length": 1,
"no_profanity": 2, # More important
"factually_correct": 3 # Most important
}
total_weight = sum(weights.values())
weighted_sum = sum(criteria_scores[k] * weights[k] for k in criteria_scores)
final_score = weighted_sum / total_weight
return {
"score": final_score,
"criteria_scores": criteria_scores,
"all_passed": all(v == 1.0 for v in criteria_scores.values())
}
How do I check if answers are semantically similar?
---------------------------------------------------
**Use Embeddings and Cosine Similarity**
.. code-block:: python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Load model once (outside evaluator for efficiency)
model = SentenceTransformer('all-MiniLM-L6-v2')
@evaluator()
def semantic_similarity(outputs, inputs, ground_truth):
"""Calculate semantic similarity using embeddings."""
expected = ground_truth.get("answer", "")
actual = outputs.get("answer", "")
# Generate embeddings
expected_emb = model.encode([expected])
actual_emb = model.encode([actual])
# Cosine similarity
similarity = cosine_similarity(expected_emb, actual_emb)[0][0]
return {
"score": float(similarity),
"passed": similarity >= 0.8,
"similarity": float(similarity)
}
.. note::
**Dependencies**: Install required packages:
.. code-block:: bash
pip install sentence-transformers scikit-learn
How do I run multiple evaluators on the same outputs?
-----------------------------------------------------
**Pass List of Evaluators**
.. code-block:: python
from honeyhive.experiments import evaluate, evaluator
@evaluator()
def accuracy(outputs, inputs, ground_truth):
return 1.0 if outputs["answer"] == ground_truth["answer"] else 0.0
@evaluator()
def length_check(outputs, inputs, ground_truth):
return 1.0 if 10 <= len(outputs["answer"]) <= 200 else 0.5
@evaluator()
def has_sources(outputs, inputs, ground_truth):
return 1.0 if "sources" in outputs else 0.0
# Run all evaluators
result = evaluate(
function=my_function,
dataset=dataset,
evaluators=[accuracy, length_check, has_sources],
api_key="your-api-key",
project="your-project"
)
# Each evaluator's results stored as separate metrics
What if my evaluator encounters errors?
---------------------------------------
**Add Try-Except Blocks**
.. code-block:: python
@evaluator()
def robust_evaluator(outputs, inputs, ground_truth):
"""Evaluator with error handling."""
try:
# Your evaluation logic
score = calculate_score(outputs, ground_truth)
return {"score": score}
except KeyError as e:
# Missing expected key
return {
"score": 0.0,
"error": f"Missing key: {e}",
"error_type": "KeyError"
}
except ValueError as e:
# Invalid value
return {
"score": 0.0,
"error": f"Invalid value: {e}",
"error_type": "ValueError"
}
except Exception as e:
# General error
return {
"score": 0.0,
"error": str(e),
"error_type": type(e).__name__
}
Best Practices
--------------
**Keep Evaluators Pure**
.. code-block:: python
# ✅ Good: Pure function, no side effects
@evaluator()
def good_evaluator(outputs, inputs, ground_truth):
score = calculate_score(outputs, ground_truth)
return {"score": score}
# ❌ Bad: Has side effects
@evaluator()
def bad_evaluator(outputs, inputs, ground_truth):
database.save(outputs) # Side effect!
score = calculate_score(outputs, ground_truth)
return {"score": score}
**Handle Missing Data**
.. code-block:: python
@evaluator()
def safe_evaluator(outputs, inputs, ground_truth):
# Use .get() with defaults
answer = outputs.get("answer", "")
expected = ground_truth.get("answer", "") if ground_truth else ""
if not answer:
return {"score": 0.0, "reason": "No answer provided"}
if not expected:
return {"score": 0.5, "reason": "No ground truth available"}
# Continue with evaluation
score = compare(answer, expected)
return {"score": score}
**Use Descriptive Names**
.. code-block:: python
# ❌ Bad: Unclear name
@evaluator(name="eval1")
def e1(outputs, inputs, ground_truth):
return 0.5
# ✅ Good: Clear name
@evaluator(name="answer_length_50_200_words")
def check_answer_length(outputs, inputs, ground_truth):
word_count = len(outputs.get("answer", "").split())
return 1.0 if 50 <= word_count <= 200 else 0.5
See Also
--------
- :doc:`running-experiments` - Use evaluators in evaluate()
- :doc:`server-side-evaluators` - Configure evaluators in UI
- :doc:`best-practices` - Evaluation strategy design
- :doc:`../../reference/experiments/evaluators` - Complete @evaluator API reference