Results Retrieval
Functions for retrieving and comparing experiment results from the backend.
get_run_result()
- get_run_result(client, run_id, aggregate_function='average')
Retrieve aggregated results for an experiment run from the backend.
The backend computes aggregated metrics across all datapoints using the specified aggregation function.
- Parameters:
- Returns:
Experiment result summary with aggregated metrics
- Return type:
Usage:
from honeyhive import HoneyHive as Client from honeyhive.experiments import get_run_result client = honeyhive.HoneyHive(api_key="your-key") result = get_run_result(client, run_id="run-abc-123") print(f"Status: {result.status}") print(f"Success: {result.success}") print(f"Passed: {len(result.passed)}") print(f"Failed: {len(result.failed)}") # Access aggregated metrics accuracy = result.metrics.get_metric("accuracy_evaluator") print(f"Average accuracy: {accuracy}")
Custom Aggregation:
# Use median instead of average result = get_run_result( client, run_id="run-123", aggregate_function="median" )
get_run_metrics()
- get_run_metrics(client, run_id)
Retrieve raw (non-aggregated) metrics for an experiment run.
Returns the full metrics data from the backend without aggregation, useful for detailed analysis or custom aggregation.
- Parameters:
- Returns:
Dictionary containing raw metrics data
- Return type:
Dict[str, Any]
Usage:
from honeyhive import HoneyHive as Client from honeyhive.experiments import get_run_metrics client = honeyhive.HoneyHive(api_key="your-key") metrics = get_run_metrics(client, run_id="run-abc-123") # Raw metrics include per-datapoint data print(f"Raw metrics: {metrics}")
compare_runs()
- compare_runs(client, new_run_id, old_run_id, aggregate_function='average')
Compare two experiment runs using backend aggregated comparison.
The backend identifies common datapoints between runs, computes metric deltas, and classifies changes as improvements or degradations.
- Parameters:
- Returns:
Comparison result with metric deltas and improvement analysis
- Return type:
Basic Comparison:
from honeyhive import HoneyHive as Client from honeyhive.experiments import compare_runs client = honeyhive.HoneyHive(api_key="your-key") comparison = compare_runs( client=client, new_run_id="run-v2", old_run_id="run-v1" ) print(f"Common datapoints: {comparison.common_datapoints}") print(f"Improved metrics: {comparison.list_improved_metrics()}") print(f"Degraded metrics: {comparison.list_degraded_metrics()}")
Detailed Metric Analysis:
comparison = compare_runs(client, "run-new", "run-old") # Check specific metric accuracy_delta = comparison.get_metric_delta("accuracy") if accuracy_delta: print(f"Old accuracy: {accuracy_delta['old_aggregate']}") print(f"New accuracy: {accuracy_delta['new_aggregate']}") print(f"Improved on {accuracy_delta['improved_count']} datapoints") print(f"Degraded on {accuracy_delta['degraded_count']} datapoints") # Get specific datapoint IDs print(f"Improved: {accuracy_delta['improved']}") print(f"Degraded: {accuracy_delta['degraded']}")
A/B Testing Pattern:
# Run baseline baseline = evaluate( function=model_a, dataset=test_data, evaluators=[accuracy, latency], name="baseline-model-a", api_key="key", project="project" ) # Run variant variant = evaluate( function=model_b, dataset=test_data, evaluators=[accuracy, latency], name="variant-model-b", api_key="key", project="project" ) # Compare comparison = compare_runs( client, new_run_id=variant.run_id, old_run_id=baseline.run_id ) # Decision logic improved = comparison.list_improved_metrics() degraded = comparison.list_degraded_metrics() if "accuracy" in improved and "latency" not in degraded: print("✅ Model B is better - deploy it!") else: print("❌ Model A is still better - keep baseline")
Best Practices
1. Use Consistent Datasets for Comparison
# GOOD - Same dataset for both runs
dataset = load_test_dataset()
run1 = evaluate(function=model_v1, dataset=dataset, ...)
run2 = evaluate(function=model_v2, dataset=dataset, ...)
comparison = compare_runs(client, run2.run_id, run1.run_id)
2. Cache Results for Analysis
# Retrieve once, analyze many times
result = get_run_result(client, run_id)
# Multiple analyses without re-fetching
accuracy = result.metrics.get_metric("accuracy")
latency = result.metrics.get_metric("latency")
cost = result.metrics.get_metric("cost")
3. Handle Missing Metrics Gracefully
comparison = compare_runs(client, new_id, old_id)
# Some metrics might not exist in both runs
accuracy_delta = comparison.get_metric_delta("accuracy")
if accuracy_delta:
print(f"Accuracy changed: {accuracy_delta['new_aggregate'] - accuracy_delta['old_aggregate']}")
else:
print("Accuracy metric not found in both runs")
4. Use Appropriate Aggregation
# For accuracy/pass rates - use average
result = get_run_result(client, run_id, aggregate_function="average")
# For total cost - use sum
result = get_run_result(client, run_id, aggregate_function="sum")
# For worst-case analysis - use min/max
result = get_run_result(client, run_id, aggregate_function="min")
See Also
Core Functions - Run experiments
Data Models - Result data models
Evaluation & Analysis Guides - Evaluation patterns