Data Models

Pydantic models for experiment runs, results, and comparisons.

ExperimentRunStatus

class ExperimentRunStatus

Enum representing the status of an experiment run.

Values:

  • PENDING - Run created but not started

  • RUNNING - Currently executing

  • COMPLETED - Finished successfully

  • FAILED - Execution failed

  • CANCELLED - Manually cancelled

Usage:

from honeyhive.experiments import ExperimentRunStatus

if result.status == ExperimentRunStatus.COMPLETED:
    print("Experiment finished!")

MetricDatapoints

class MetricDatapoints

Model for tracking passed/failed datapoint IDs per metric.

Attributes:

  • passed (List[str]) - List of datapoint IDs that passed this metric

  • failed (List[str]) - List of datapoint IDs that failed this metric

MetricDetail

class MetricDetail

Detailed information about a single metric result.

Attributes:

  • metric_name (str) - Name of the metric

  • metric_type (Optional[str]) - Type of metric (“numeric”, “boolean”, etc.)

  • event_name (Optional[str]) - Name of the event that generated this metric

  • event_type (Optional[str]) - Type of event (“model”, “tool”, etc.)

  • aggregate (Optional[Union[float, int, bool, str]]) - Aggregated value across all datapoints

  • values (Optional[List[Any]]) - Individual values per datapoint

  • datapoints (Optional[MetricDatapoints]) - Passed/failed datapoint tracking

Usage:

from honeyhive.experiments import get_run_result

result = get_run_result(client, "run-123")

# Get a specific metric detail
accuracy = result.metrics.get_metric("accuracy_evaluator")
if accuracy:
    print(f"Aggregate: {accuracy.aggregate}")
    print(f"Type: {accuracy.metric_type}")

DatapointMetric

class DatapointMetric

Individual metric value for a single datapoint.

Attributes:

  • name (str) - Name of the metric

  • event_name (Optional[str]) - Name of the event

  • event_type (Optional[str]) - Type of event

  • value (Optional[Union[float, int, bool, str]]) - Metric value for this datapoint

  • passed (Optional[bool]) - Whether this metric passed for this datapoint

DatapointResult

class DatapointResult

Result for a single datapoint in an experiment run.

Attributes:

  • datapoint_id (Optional[str]) - Unique identifier for the datapoint

  • session_id (Optional[str]) - Session ID associated with this datapoint

  • passed (Optional[bool]) - Whether all metrics passed for this datapoint

  • metrics (Optional[List[DatapointMetric]]) - Individual metric results

Usage:

from honeyhive.experiments import get_run_result

result = get_run_result(client, "run-123")

for datapoint in result.datapoints:
    print(f"Datapoint: {datapoint.datapoint_id}")
    print(f"Passed: {datapoint.passed}")
    if datapoint.metrics:
        for metric in datapoint.metrics:
            print(f"  {metric.name}: {metric.value}")

AggregatedMetrics

class AggregatedMetrics

Aggregated experiment metrics with support for both new details array format and legacy model_extra format for backward compatibility.

Attributes:

  • aggregation_function (Optional[str]) - Aggregation method used (“average”, “sum”, etc.)

  • details (List[MetricDetail]) - List of metric details from backend (new format)

Methods:

get_metric(metric_name: str) MetricDetail | Dict[str, Any] | None

Get value for a specific metric. Supports both new details array format (returns MetricDetail) and legacy model_extra format (returns dict).

Parameters:

metric_name – Name of the metric

Returns:

MetricDetail object, dict, or None if not found

list_metrics() List[str]

List all available metric names.

Returns:

List of metric names from details array or model_extra keys

get_all_metrics() Dict[str, MetricDetail | Dict[str, Any]]

Get all metrics as a dictionary.

Returns:

Dictionary mapping metric names to MetricDetail objects or dicts

Usage:

from honeyhive.experiments import get_run_result

result = get_run_result(client, "run-123")
metrics = result.metrics

# Get specific metric (returns MetricDetail with new format)
accuracy = metrics.get_metric("accuracy_evaluator")
if accuracy:
    # Access typed attributes
    print(f"Aggregate: {accuracy.aggregate}")
    print(f"Type: {accuracy.metric_type}")

# List all metrics
metric_names = metrics.list_metrics()

# Get all as dict
all_metrics = metrics.get_all_metrics()

ExperimentResultSummary

class ExperimentResultSummary

Complete summary of an experiment run with aggregated results.

Attributes:

  • run_id (str) - Unique run identifier

  • status (ExperimentRunStatus) - Current run status

  • success (bool) - Whether run completed successfully

  • passed (List[str]) - List of passed datapoint IDs

  • failed (List[str]) - List of failed datapoint IDs

  • metrics (AggregatedMetrics) - Aggregated evaluation metrics

  • datapoints (List[Any]) - Individual datapoint results

Usage:

from honeyhive.experiments import evaluate, evaluator

@evaluator
def my_evaluator(outputs, inputs, ground_truth):
    return {"score": 0.9}

result = evaluate(
    function=my_function,
    dataset=test_data,
    evaluators=[my_evaluator],
    api_key="key",
    project="project"
)

# Access summary fields
print(f"Run ID: {result.run_id}")
print(f"Status: {result.status}")
print(f"Success: {result.success}")
print(f"Passed: {len(result.passed)}")
print(f"Failed: {len(result.failed)}")

# Access metrics
avg_score = result.metrics.get_metric("my_evaluator")
print(f"Average score: {avg_score}")

RunComparisonResult

class RunComparisonResult

Result of comparing two experiment runs.

Attributes:

  • new_run_id (str) - ID of the new run

  • old_run_id (str) - ID of the old run

  • common_datapoints (int) - Count of datapoints in both runs

  • new_only_datapoints (int) - Count of datapoints only in new run

  • old_only_datapoints (int) - Count of datapoints only in old run

  • metric_deltas (Dict[str, Any]) - Per-metric comparison data

Methods:

get_metric_delta(metric_name: str) Dict[str, Any] | None

Get comparison data for a specific metric.

Parameters:

metric_name – Name of the metric

Returns:

Dict with delta information or None

Returns dict with keys:

  • old_aggregate - Old run’s aggregated value

  • new_aggregate - New run’s aggregated value

  • improved_count - Number of improved datapoints

  • degraded_count - Number of degraded datapoints

  • improved - List of improved datapoint IDs

  • degraded - List of degraded datapoint IDs

list_improved_metrics() List[str]

List metrics that improved in the new run.

Returns:

List of metric names with improved_count > 0

list_degraded_metrics() List[str]

List metrics that degraded in the new run.

Returns:

List of metric names with degraded_count > 0

Usage:

from honeyhive.experiments import compare_runs

comparison = compare_runs(
    client=client,
    new_run_id="run-new",
    old_run_id="run-old"
)

# Overview
print(f"Common datapoints: {comparison.common_datapoints}")
print(f"New datapoints: {comparison.new_only_datapoints}")
print(f"Old datapoints: {comparison.old_only_datapoints}")

# Metric analysis
improved = comparison.list_improved_metrics()
degraded = comparison.list_degraded_metrics()

print(f"Improved: {improved}")
print(f"Degraded: {degraded}")

# Detailed metric delta
accuracy_delta = comparison.get_metric_delta("accuracy")
if accuracy_delta:
    print(f"Old: {accuracy_delta['old_aggregate']}")
    print(f"New: {accuracy_delta['new_aggregate']}")
    print(f"Improved datapoints: {len(accuracy_delta['improved'])}")

See Also