Tutorial 5: Run Your First Experiment ===================================== .. note:: **Tutorial** (15-20 minutes) This is a hands-on tutorial that takes you step-by-step through running your first experiment with HoneyHive. You'll create a working example and see results in the dashboard. What You'll Learn ----------------- By the end of this tutorial, you'll know how to: - Run an experiment with ``evaluate()`` - Structure test data with inputs and ground truths - **Create evaluators to automatically score outputs** - **View metrics and scores in HoneyHive dashboard** - Compare different versions of your function What You'll Build ----------------- A complete question-answering experiment with automated evaluation. You'll: 1. Create a baseline QA function 2. Test it against a dataset 3. **Add evaluators to automatically score outputs** 4. **Compare baseline vs improved version using metrics** 5. View results and metrics in HoneyHive dashboard Prerequisites ------------- Before starting this tutorial, you should: - Complete :doc:`01-setup-first-tracer` - Have Python 3.11 or higher installed - Have a HoneyHive API key - Basic familiarity with Python dictionaries If you haven't set up the SDK yet, go back to Tutorial 1. Step 1: Install and Setup ------------------------- First, create a new Python file for this tutorial: .. code-block:: bash touch my_first_experiment.py Add the necessary imports and setup: .. code-block:: python # my_first_experiment.py import os from typing import Any, Dict from honeyhive.experiments import evaluate # Set your API key os.environ["HH_API_KEY"] = "your-api-key-here" os.environ["HH_PROJECT"] = "experiments-tutorial" .. tip:: Store your API key in a ``.env`` file instead of hardcoding it. See :doc:`../how-to/deployment/production` for production best practices. Step 2: Define Your Function ---------------------------- Create a simple function that answers questions. This will be the function we test in our experiment: .. code-block:: python def answer_question(datapoint: Dict[str, Any]) -> Dict[str, Any]: """Answer a trivia question. This is the function we'll test in our experiment. Args: datapoint: Contains 'inputs' with the question Returns: Dictionary with the answer Note: The evaluation function can also accept a 'tracer' parameter if you need to access the tracer instance within your function for manual tracing: def answer_question(datapoint: Dict[str, Any], tracer: HoneyHiveTracer) -> Dict[str, Any]: # Use tracer for custom spans, enrichment, etc. pass """ inputs = datapoint.get("inputs", {}) question = inputs.get("question", "") # Simple logic: check for keywords # (In real use, you'd call an LLM here) if "capital" in question.lower() and "france" in question.lower(): answer = "Paris" elif "2+2" in question: answer = "4" elif "color" in question.lower() and "sky" in question.lower(): answer = "blue" else: answer = "I don't know" return { "answer": answer, "confidence": "high" if answer != "I don't know" else "low" } .. note:: This example uses simple logic for demonstration. In a real experiment, you'd call an LLM API (OpenAI, Anthropic, etc.) inside this function. Step 3: Create Your Test Dataset -------------------------------- Define a dataset with questions and expected answers: .. code-block:: python dataset = [ { "inputs": { "question": "What is the capital of France?" }, "ground_truth": { "answer": "Paris", "category": "geography" } }, { "inputs": { "question": "What is 2+2?" }, "ground_truth": { "answer": "4", "category": "math" } }, { "inputs": { "question": "What color is the sky?" }, "ground_truth": { "answer": "blue", "category": "science" } } ] **Understanding the Structure:** - ``inputs``: What your function receives - ``ground_truth``: The expected correct answers (used for evaluation) Step 4: Run Your Experiment --------------------------- Now run the experiment: .. code-block:: python result = evaluate( function=answer_question, dataset=dataset, name="qa-baseline-v1", verbose=True # Show progress ) print(f"\nāœ… Experiment complete!") print(f"šŸ“Š Run ID: {result.run_id}") print(f"šŸ“ˆ Status: {result.status}") **Run it:** .. code-block:: bash python my_first_experiment.py **Expected Output:** .. code-block:: text Processing datapoint 1/3... Processing datapoint 2/3... Processing datapoint 3/3... āœ… Experiment complete! šŸ“Š Run ID: run_abc123... šŸ“ˆ Status: completed Step 5: View Results in Dashboard --------------------------------- 1. Go to `HoneyHive Experiments Dashboard `_ 2. Navigate to your project: ``experiments-tutorial`` 3. Find your run: ``qa-baseline-v1`` 5. Click to view: - Session traces for each question - Function outputs - Ground truths - Session metadata **What You'll See:** - 3 sessions (one per datapoint) - Each session shows inputs and outputs - Ground truths displayed for comparison - Session names include your experiment name Step 6: Add Evaluators for Automated Scoring -------------------------------------------- Viewing results manually is helpful, but let's add **evaluators** to automatically score our function's outputs: .. code-block:: python def exact_match_evaluator( outputs: Dict[str, Any], inputs: Dict[str, Any], ground_truth: Dict[str, Any] ) -> float: """Check if answer exactly matches ground truth. Args: outputs: Function's output (from answer_question) inputs: Original inputs (not used here) ground_truth: Expected outputs Returns: 1.0 if exact match, 0.0 otherwise """ actual_answer = outputs.get("answer", "").lower().strip() expected_answer = ground_truth.get("answer", "").lower().strip() return 1.0 if actual_answer == expected_answer else 0.0 def confidence_evaluator( outputs: Dict[str, Any], inputs: Dict[str, Any], ground_truth: Dict[str, Any] ) -> float: """Check if confidence is appropriate. Returns: 1.0 if high confidence, 0.5 if low confidence confidence = outputs.get("confidence", "low") return 1.0 if confidence == "high" else 0.5 **Understanding Evaluators:** - **Input**: Receives ``(outputs, inputs, ground_truth)`` - **Output**: Returns a score (typically 0.0 to 1.0) - **Purpose**: Automated quality assessment - **Runs**: After function executes, for each datapoint Step 7: Run Experiment with Evaluators -------------------------------------- Now run the experiment with evaluators: .. code-block:: python result = evaluate( function=answer_question, dataset=dataset, evaluators=[exact_match_evaluator, confidence_evaluator], # Added! name="qa-baseline-with-metrics-v1", verbose=True ) print(f"\nāœ… Experiment complete!") print(f"šŸ“Š Run ID: {result.run_id}") print(f"šŸ“ˆ Status: {result.status}") # Access metrics if result.metrics: print(f"\nšŸ“Š Aggregated Metrics:") # Metrics stored in model_extra for Pydantic v2 extra_fields = getattr(result.metrics, "model_extra", {}) for metric_name, metric_value in extra_fields.items(): print(f" {metric_name}: {metric_value:.2f}") **Expected Output:** .. code-block:: text Processing datapoint 1/3... Processing datapoint 2/3... Processing datapoint 3/3... Running evaluators... āœ… Experiment complete! šŸ“Š Run ID: run_xyz789... šŸ“ˆ Status: completed šŸ“Š Aggregated Metrics: exact_match_evaluator: 1.00 confidence_evaluator: 1.00 Step 8: View Metrics in Dashboard --------------------------------- Go back to the HoneyHive dashboard: 1. Find your new run: ``qa-baseline-with-metrics-v1`` 2. Click to view details 3. You'll now see: - **Metrics tab**: Aggregated scores - **Per-datapoint metrics**: Individual scores - **Metric trends**: Compare across runs **What You'll See:** - Exact match score: 100% (3/3 correct) - Confidence score: 100% (all high confidence) - Metrics visualized as charts - Per-session metrics in session details Step 9: Test an Improvement --------------------------- Let's test an improved version WITH evaluators: .. code-block:: python def answer_question_improved(datapoint: Dict[str, Any]) -> Dict[str, Any]: """Improved version with better logic.""" inputs = datapoint.get("inputs", {}) question = inputs.get("question", "").lower() # More sophisticated keyword matching answers = { "capital of france": "Paris", "2+2": "4", "color of the sky": "blue", "color is the sky": "blue" } # Check each pattern for pattern, ans in answers.items(): if all(word in question for word in pattern.split()): return {"answer": ans, "confidence": "high"} return {"answer": "I don't know", "confidence": "low"} # Run improved version WITH EVALUATORS result_v2 = evaluate( function=answer_question_improved, dataset=dataset, evaluators=[exact_match_evaluator, confidence_evaluator], # Same evaluators! name="qa-improved-with-metrics-v1", verbose=True ) print(f"\nāœ… Improved version complete!") print(f"šŸ“Š Run ID: {result_v2.run_id}") # Compare metrics if result_v2.metrics: print(f"\nšŸ“Š Metrics:") extra_fields = getattr(result_v2.metrics, "model_extra", {}) for metric_name, metric_value in extra_fields.items(): print(f" {metric_name}: {metric_value:.2f}") Now you have TWO runs to compare! **Compare in the Dashboard OR via API:** .. code-block:: python # Option 1: View comparison in HoneyHive dashboard (visual) # Go to: https://app.honeyhive.ai/evaluate → Select runs → Click Compare # Option 2: Programmatic comparison via API from honeyhive.experiments import compare_runs from honeyhive import HoneyHive client = HoneyHive(api_key=os.environ["HH_API_KEY"]) comparison = compare_runs( client=client, new_run_id=result_v2.run_id, old_run_id=result.run_id ) print(f"\nProgrammatic Comparison:") print(f"Common datapoints: {comparison.common_datapoints}") print(f"Improved metrics: {comparison.list_improved_metrics()}") print(f"Degraded metrics: {comparison.list_degraded_metrics()}") # Access detailed metric deltas for metric_name, delta in comparison.metric_deltas.items(): old_val = delta.get("old_aggregate", 0) new_val = delta.get("new_aggregate", 0) change = new_val - old_val print(f"{metric_name}: {old_val:.2f} → {new_val:.2f} ({change:+.2f})") .. tip:: **Use both approaches:** - **Dashboard** for visual exploration and sharing with team - **API** for automated decision-making and CI/CD pipelines What You've Learned ------------------- Congratulations! You've: āœ… Created your first evaluation function āœ… Structured test data with inputs and ground truths āœ… **Created evaluators to automatically score outputs** āœ… Run experiments with ``evaluate()`` and evaluators āœ… Viewed results and metrics in HoneyHive dashboard āœ… **Compared runs using both dashboard and API** **Key Concepts:** - **Evaluation Function**: Your application logic under test - **Dataset**: Test cases with inputs and ground truths - **Evaluators**: Automated scoring functions - **Metrics**: Quantitative measurements of quality - **Comparison**: Compare runs via dashboard (visual) or API (programmatic) Next Steps ---------- Now that you understand the basics: - :doc:`../how-to/evaluation/creating-evaluators` - Add automated scoring - :doc:`../how-to/evaluation/comparing-experiments` - Compare runs statistically - :doc:`../how-to/evaluation/dataset-management` - Use datasets from HoneyHive UI - :doc:`../how-to/evaluation/best-practices` - Production experiment patterns Complete Code ------------- Here's the complete code from this tutorial: .. code-block:: python # my_first_experiment.py import os from typing import Any, Dict from honeyhive.experiments import evaluate os.environ["HH_API_KEY"] = "your-api-key-here" os.environ["HH_PROJECT"] = "experiments-tutorial" def answer_question(datapoint: Dict[str, Any]) -> Dict[str, Any]: """Answer a trivia question.""" inputs = datapoint.get("inputs", {}) question = inputs.get("question", "") if "capital" in question.lower() and "france" in question.lower(): answer = "Paris" elif "2+2" in question: answer = "4" elif "color" in question.lower() and "sky" in question.lower(): answer = "blue" else: answer = "I don't know" return {"answer": answer, "confidence": "high" if answer != "I don't know" else "low"} dataset = [ { "inputs": {"question": "What is the capital of France?"}, "ground_truth": {"answer": "Paris"} }, { "inputs": {"question": "What is 2+2?"}, "ground_truth": {"answer": "4"} }, { "inputs": {"question": "What color is the sky?"}, "ground_truth": {"answer": "blue"} } ] # Define evaluators def exact_match_evaluator( outputs: Dict[str, Any], inputs: Dict[str, Any], ground_truth: Dict[str, Any] ) -> float: """Check if answer exactly matches ground truth.""" actual = outputs.get("answer", "").lower().strip() expected = ground_truth.get("answer", "").lower().strip() return 1.0 if actual == expected else 0.0 def confidence_evaluator( outputs: Dict[str, Any], inputs: Dict[str, Any], ground_truth: Dict[str, Any] ) -> float: """Check if confidence is appropriate.""" confidence = outputs.get("confidence", "low") return 1.0 if confidence == "high" else 0.5 # Run experiment with evaluators result = evaluate( function=answer_question, dataset=dataset, evaluators=[exact_match_evaluator, confidence_evaluator], name="qa-baseline-with-metrics-v1", verbose=True ) print(f"\nāœ… Experiment complete! Run ID: {result.run_id}") # Print metrics if result.metrics: print(f"\nšŸ“Š Metrics:") extra_fields = getattr(result.metrics, "model_extra", {}) for metric_name, metric_value in extra_fields.items(): print(f" {metric_name}: {metric_value:.2f}")