Data Models

Pydantic models for experiment runs, results, and comparisons.

ExperimentRunStatus

class ExperimentRunStatus

Enum representing the status of an experiment run.

Values:

PENDING - Run created but not started
RUNNING - Currently executing
COMPLETED - Finished successfully
FAILED - Execution failed
CANCELLED - Manually cancelled

Usage:

from honeyhive.experiments import ExperimentRunStatus

if result.status == ExperimentRunStatus.COMPLETED:
    print("Experiment finished!")

MetricDatapoints

class MetricDatapoints

Model for tracking passed/failed datapoint IDs per metric.

Attributes:

passed (List[str]) - List of datapoint IDs that passed this metric
failed (List[str]) - List of datapoint IDs that failed this metric

MetricDetail

class MetricDetail

Detailed information about a single metric result.

Attributes:

metric_name (str) - Name of the metric
metric_type (Optional[str]) - Type of metric (“numeric”, “boolean”, etc.)
event_name (Optional[str]) - Name of the event that generated this metric
event_type (Optional[str]) - Type of event (“model”, “tool”, etc.)
aggregate (Optional[Union[float, int, bool, str]]) - Aggregated value across all datapoints
values (Optional[List[Any]]) - Individual values per datapoint
datapoints (Optional[MetricDatapoints]) - Passed/failed datapoint tracking

Usage:

from honeyhive.experiments import get_run_result

result = get_run_result(client, "run-123")

# Get a specific metric detail
accuracy = result.metrics.get_metric("accuracy_evaluator")
if accuracy:
    print(f"Aggregate: {accuracy.aggregate}")
    print(f"Type: {accuracy.metric_type}")

DatapointMetric

class DatapointMetric

Individual metric value for a single datapoint.

Attributes:

name (str) - Name of the metric
event_name (Optional[str]) - Name of the event
event_type (Optional[str]) - Type of event
value (Optional[Union[float, int, bool, str]]) - Metric value for this datapoint
passed (Optional[bool]) - Whether this metric passed for this datapoint

DatapointResult

class DatapointResult

Result for a single datapoint in an experiment run.

Attributes:

datapoint_id (Optional[str]) - Unique identifier for the datapoint
session_id (Optional[str]) - Session ID associated with this datapoint
passed (Optional[bool]) - Whether all metrics passed for this datapoint
metrics (Optional[List[DatapointMetric]]) - Individual metric results

Usage:

from honeyhive.experiments import get_run_result

result = get_run_result(client, "run-123")

for datapoint in result.datapoints:
    print(f"Datapoint: {datapoint.datapoint_id}")
    print(f"Passed: {datapoint.passed}")
    if datapoint.metrics:
        for metric in datapoint.metrics:
            print(f"  {metric.name}: {metric.value}")

AggregatedMetrics

class AggregatedMetrics

Aggregated experiment metrics with support for both new details array format and legacy model_extra format for backward compatibility.

Attributes:

aggregation_function (Optional[str]) - Aggregation method used (“average”, “sum”, etc.)
details (List[MetricDetail]) - List of metric details from backend (new format)

Methods:

get_metric(metric_name: str) → MetricDetail | Dict[str, Any] | None

Get value for a specific metric. Supports both new details array format (returns MetricDetail) and legacy model_extra format (returns dict).

Parameters:: metric_name – Name of the metric
Returns:: MetricDetail object, dict, or None if not found

list_metrics() → List[str]

List all available metric names.

Returns:: List of metric names from details array or model_extra keys

get_all_metrics() → Dict[str, MetricDetail | Dict[str, Any]]

Get all metrics as a dictionary.

Returns:: Dictionary mapping metric names to MetricDetail objects or dicts

Usage:

from honeyhive.experiments import get_run_result

result = get_run_result(client, "run-123")
metrics = result.metrics

# Get specific metric (returns MetricDetail with new format)
accuracy = metrics.get_metric("accuracy_evaluator")
if accuracy:
    # Access typed attributes
    print(f"Aggregate: {accuracy.aggregate}")
    print(f"Type: {accuracy.metric_type}")

# List all metrics
metric_names = metrics.list_metrics()

# Get all as dict
all_metrics = metrics.get_all_metrics()

ExperimentResultSummary

class ExperimentResultSummary

Complete summary of an experiment run with aggregated results.

Attributes:

run_id (str) - Unique run identifier
status (ExperimentRunStatus) - Current run status
success (bool) - Whether run completed successfully
passed (List[str]) - List of passed datapoint IDs
failed (List[str]) - List of failed datapoint IDs
metrics (AggregatedMetrics) - Aggregated evaluation metrics
datapoints (List[Any]) - Individual datapoint results

Usage:

from honeyhive.experiments import evaluate, evaluator

@evaluator
def my_evaluator(outputs, inputs, ground_truth):
    return {"score": 0.9}

result = evaluate(
    function=my_function,
    dataset=test_data,
    evaluators=[my_evaluator],
    api_key="key",
    project="project"
)

# Access summary fields
print(f"Run ID: {result.run_id}")
print(f"Status: {result.status}")
print(f"Success: {result.success}")
print(f"Passed: {len(result.passed)}")
print(f"Failed: {len(result.failed)}")

# Access metrics
avg_score = result.metrics.get_metric("my_evaluator")
print(f"Average score: {avg_score}")

RunComparisonResult

class RunComparisonResult

Result of comparing two experiment runs.

Attributes:

new_run_id (str) - ID of the new run
old_run_id (str) - ID of the old run
common_datapoints (int) - Count of datapoints in both runs
new_only_datapoints (int) - Count of datapoints only in new run
old_only_datapoints (int) - Count of datapoints only in old run
metric_deltas (Dict[str, Any]) - Per-metric comparison data

Methods:

get_metric_delta(metric_name: str) → Dict[str, Any] | None

Get comparison data for a specific metric.

Parameters:: metric_name – Name of the metric
Returns:: Dict with delta information or None

Returns dict with keys:

old_aggregate - Old run’s aggregated value
new_aggregate - New run’s aggregated value
improved_count - Number of improved datapoints
degraded_count - Number of degraded datapoints
improved - List of improved datapoint IDs
degraded - List of degraded datapoint IDs

list_improved_metrics() → List[str]

List metrics that improved in the new run.

Returns:: List of metric names with improved_count > 0

list_degraded_metrics() → List[str]

List metrics that degraded in the new run.

Returns:: List of metric names with degraded_count > 0

Usage:

from honeyhive.experiments import compare_runs

comparison = compare_runs(
    client=client,
    new_run_id="run-new",
    old_run_id="run-old"
)

# Overview
print(f"Common datapoints: {comparison.common_datapoints}")
print(f"New datapoints: {comparison.new_only_datapoints}")
print(f"Old datapoints: {comparison.old_only_datapoints}")

# Metric analysis
improved = comparison.list_improved_metrics()
degraded = comparison.list_degraded_metrics()

print(f"Improved: {improved}")
print(f"Degraded: {degraded}")

# Detailed metric delta
accuracy_delta = comparison.get_metric_delta("accuracy")
if accuracy_delta:
    print(f"Old: {accuracy_delta['old_aggregate']}")
    print(f"New: {accuracy_delta['new_aggregate']}")
    print(f"Improved datapoints: {len(accuracy_delta['improved'])}")

Data Models

ExperimentRunStatus

MetricDatapoints

MetricDetail

DatapointMetric

DatapointResult

AggregatedMetrics

ExperimentResultSummary

RunComparisonResult

See Also