Results Retrieval

Functions for retrieving and comparing experiment results from the backend.

get_run_result()

get_run_result(client, run_id, aggregate_function='average')

Retrieve aggregated results for an experiment run from the backend.

The backend computes aggregated metrics across all datapoints using the specified aggregation function.

Parameters:
  • client (HoneyHive) – HoneyHive API client instance

  • run_id (str) – Experiment run ID

  • aggregate_function (str) – Aggregation method (“average”, “sum”, “min”, “max”)

Returns:

Experiment result summary with aggregated metrics

Return type:

ExperimentResultSummary

Usage:

from honeyhive import HoneyHive as Client
from honeyhive.experiments import get_run_result

client = honeyhive.HoneyHive(api_key="your-key")

result = get_run_result(client, run_id="run-abc-123")

print(f"Status: {result.status}")
print(f"Success: {result.success}")
print(f"Passed: {len(result.passed)}")
print(f"Failed: {len(result.failed)}")

# Access aggregated metrics
accuracy = result.metrics.get_metric("accuracy_evaluator")
print(f"Average accuracy: {accuracy}")

Custom Aggregation:

# Use median instead of average
result = get_run_result(
    client,
    run_id="run-123",
    aggregate_function="median"
)

get_run_metrics()

get_run_metrics(client, run_id)

Retrieve raw (non-aggregated) metrics for an experiment run.

Returns the full metrics data from the backend without aggregation, useful for detailed analysis or custom aggregation.

Parameters:
  • client (HoneyHive) – HoneyHive API client instance

  • run_id (str) – Experiment run ID

Returns:

Dictionary containing raw metrics data

Return type:

Dict[str, Any]

Usage:

from honeyhive import HoneyHive as Client
from honeyhive.experiments import get_run_metrics

client = honeyhive.HoneyHive(api_key="your-key")

metrics = get_run_metrics(client, run_id="run-abc-123")

# Raw metrics include per-datapoint data
print(f"Raw metrics: {metrics}")

compare_runs()

compare_runs(client, new_run_id, old_run_id, aggregate_function='average')

Compare two experiment runs using backend aggregated comparison.

The backend identifies common datapoints between runs, computes metric deltas, and classifies changes as improvements or degradations.

Parameters:
  • client (HoneyHive) – HoneyHive API client instance

  • new_run_id – ID of the new (more recent) run

  • old_run_id (str) – ID of the old (baseline) run

  • aggregate_function (str) – Aggregation method (“average”, “sum”, “min”, “max”)

Returns:

Comparison result with metric deltas and improvement analysis

Return type:

RunComparisonResult

Basic Comparison:

from honeyhive import HoneyHive as Client
from honeyhive.experiments import compare_runs

client = honeyhive.HoneyHive(api_key="your-key")

comparison = compare_runs(
    client=client,
    new_run_id="run-v2",
    old_run_id="run-v1"
)

print(f"Common datapoints: {comparison.common_datapoints}")
print(f"Improved metrics: {comparison.list_improved_metrics()}")
print(f"Degraded metrics: {comparison.list_degraded_metrics()}")

Detailed Metric Analysis:

comparison = compare_runs(client, "run-new", "run-old")

# Check specific metric
accuracy_delta = comparison.get_metric_delta("accuracy")

if accuracy_delta:
    print(f"Old accuracy: {accuracy_delta['old_aggregate']}")
    print(f"New accuracy: {accuracy_delta['new_aggregate']}")
    print(f"Improved on {accuracy_delta['improved_count']} datapoints")
    print(f"Degraded on {accuracy_delta['degraded_count']} datapoints")

    # Get specific datapoint IDs
    print(f"Improved: {accuracy_delta['improved']}")
    print(f"Degraded: {accuracy_delta['degraded']}")

A/B Testing Pattern:

# Run baseline
baseline = evaluate(
    function=model_a,
    dataset=test_data,
    evaluators=[accuracy, latency],
    name="baseline-model-a",
    api_key="key",
    project="project"
)

# Run variant
variant = evaluate(
    function=model_b,
    dataset=test_data,
    evaluators=[accuracy, latency],
    name="variant-model-b",
    api_key="key",
    project="project"
)

# Compare
comparison = compare_runs(
    client,
    new_run_id=variant.run_id,
    old_run_id=baseline.run_id
)

# Decision logic
improved = comparison.list_improved_metrics()
degraded = comparison.list_degraded_metrics()

if "accuracy" in improved and "latency" not in degraded:
    print("✅ Model B is better - deploy it!")
else:
    print("❌ Model A is still better - keep baseline")

Best Practices

1. Use Consistent Datasets for Comparison

# GOOD - Same dataset for both runs
dataset = load_test_dataset()

run1 = evaluate(function=model_v1, dataset=dataset, ...)
run2 = evaluate(function=model_v2, dataset=dataset, ...)

comparison = compare_runs(client, run2.run_id, run1.run_id)

2. Cache Results for Analysis

# Retrieve once, analyze many times
result = get_run_result(client, run_id)

# Multiple analyses without re-fetching
accuracy = result.metrics.get_metric("accuracy")
latency = result.metrics.get_metric("latency")
cost = result.metrics.get_metric("cost")

3. Handle Missing Metrics Gracefully

comparison = compare_runs(client, new_id, old_id)

# Some metrics might not exist in both runs
accuracy_delta = comparison.get_metric_delta("accuracy")

if accuracy_delta:
    print(f"Accuracy changed: {accuracy_delta['new_aggregate'] - accuracy_delta['old_aggregate']}")
else:
    print("Accuracy metric not found in both runs")

4. Use Appropriate Aggregation

# For accuracy/pass rates - use average
result = get_run_result(client, run_id, aggregate_function="average")

# For total cost - use sum
result = get_run_result(client, run_id, aggregate_function="sum")

# For worst-case analysis - use min/max
result = get_run_result(client, run_id, aggregate_function="min")

See Also