Skip to content

honeyhive.experiments.models

Extended models for experiments module.

This module provides extended versions of generated models to fix known issues and add experiment-specific functionality.

Models
  • ExperimentRunStatus: Extended status enum with all backend values
  • MetricDetail: Individual metric data from backend
  • DatapointResult: Individual datapoint result from backend
  • DatapointMetric: Individual metric for a datapoint
  • AggregatedMetrics: Aggregated metrics model with details array support
  • ExperimentResultSummary: Aggregated experiment result from backend
  • RunComparisonResult: Comparison between two experiment runs

ExperimentRunStatus

Bases: str, Enum

Extended status enum with all backend values.

The generated Status enum only includes 'pending' and 'completed', but the backend supports additional states.

Source code in src/honeyhive/experiments/models.py
25
26
27
28
29
30
31
32
33
34
35
36
37
class ExperimentRunStatus(str, Enum):
    """
    Extended status enum with all backend values.

    The generated Status enum only includes 'pending' and 'completed',
    but the backend supports additional states.
    """

    PENDING = "pending"
    COMPLETED = "completed"
    RUNNING = "running"
    FAILED = "failed"
    CANCELLED = "cancelled"

PENDING class-attribute instance-attribute

PENDING = 'pending'

COMPLETED class-attribute instance-attribute

COMPLETED = 'completed'

RUNNING class-attribute instance-attribute

RUNNING = 'running'

FAILED class-attribute instance-attribute

FAILED = 'failed'

CANCELLED class-attribute instance-attribute

CANCELLED = 'cancelled'

MetricDatapoints

Bases: BaseModel

Pass/fail datapoint IDs for a metric.

Source code in src/honeyhive/experiments/models.py
40
41
42
43
44
45
46
47
48
class MetricDatapoints(BaseModel):
    """Pass/fail datapoint IDs for a metric."""

    passed: List[str] = Field(
        default_factory=list, description="Datapoint IDs that passed"
    )
    failed: List[str] = Field(
        default_factory=list, description="Datapoint IDs that failed"
    )

passed class-attribute instance-attribute

passed: List[str] = Field(
    default_factory=list,
    description="Datapoint IDs that passed",
)

failed class-attribute instance-attribute

failed: List[str] = Field(
    default_factory=list,
    description="Datapoint IDs that failed",
)

MetricDetail

Bases: BaseModel

Individual metric data from backend.

This represents a single metric in the metrics.details array returned by GET /runs/:run_id/result endpoint.

Backend Response Format (per OpenAPI spec): { "metric_name": "accuracy", "metric_type": "numeric", "event_name": "llm_call", "event_type": "model", "aggregate": 0.85, "values": [0.8, 0.9, 0.85], "datapoints": { "passed": ["dp-1", "dp-2"], "failed": ["dp-3"] } }

Source code in src/honeyhive/experiments/models.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
class MetricDetail(BaseModel):
    """
    Individual metric data from backend.

    This represents a single metric in the metrics.details array returned by
    GET /runs/:run_id/result endpoint.

    Backend Response Format (per OpenAPI spec):
    {
        "metric_name": "accuracy",
        "metric_type": "numeric",
        "event_name": "llm_call",
        "event_type": "model",
        "aggregate": 0.85,
        "values": [0.8, 0.9, 0.85],
        "datapoints": {
            "passed": ["dp-1", "dp-2"],
            "failed": ["dp-3"]
        }
    }
    """

    metric_name: str = Field(..., description="Name of the metric")
    metric_type: Optional[str] = Field(
        None, description="Type of metric (numeric, boolean)"
    )
    event_name: Optional[str] = Field(
        None, description="Name of the event this metric is from"
    )
    event_type: Optional[str] = Field(
        None, description="Type of event (model, chain, etc)"
    )
    aggregate: Optional[Union[float, int, bool]] = Field(
        None, description="Aggregated value for this metric"
    )
    values: List[Union[float, int, bool]] = Field(
        default_factory=list, description="Individual values for each datapoint"
    )
    datapoints: Optional[MetricDatapoints] = Field(
        None, description="Pass/fail datapoint IDs for this metric"
    )

metric_name class-attribute instance-attribute

metric_name: str = Field(
    ..., description="Name of the metric"
)

metric_type class-attribute instance-attribute

metric_type: Optional[str] = Field(
    None, description="Type of metric (numeric, boolean)"
)

event_name class-attribute instance-attribute

event_name: Optional[str] = Field(
    None,
    description="Name of the event this metric is from",
)

event_type class-attribute instance-attribute

event_type: Optional[str] = Field(
    None, description="Type of event (model, chain, etc)"
)

aggregate class-attribute instance-attribute

aggregate: Optional[Union[float, int, bool]] = Field(
    None, description="Aggregated value for this metric"
)

values class-attribute instance-attribute

values: List[Union[float, int, bool]] = Field(
    default_factory=list,
    description="Individual values for each datapoint",
)

datapoints class-attribute instance-attribute

datapoints: Optional[MetricDatapoints] = Field(
    None,
    description="Pass/fail datapoint IDs for this metric",
)

DatapointMetric

Bases: BaseModel

Individual metric for a datapoint.

This represents a single metric in the datapoint.metrics array returned by GET /runs/:run_id/result endpoint.

Source code in src/honeyhive/experiments/models.py
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
class DatapointMetric(BaseModel):
    """
    Individual metric for a datapoint.

    This represents a single metric in the datapoint.metrics array returned by
    GET /runs/:run_id/result endpoint.
    """

    name: str = Field(..., description="Name of the metric")
    event_name: Optional[str] = Field(
        None, description="Name of the event this metric is from"
    )
    event_type: Optional[str] = Field(None, description="Type of event")
    value: Optional[Union[float, int, bool]] = Field(
        None, description="Value of the metric"
    )
    passed: Optional[bool] = Field(None, description="Whether this metric passed")

name class-attribute instance-attribute

name: str = Field(..., description='Name of the metric')

event_name class-attribute instance-attribute

event_name: Optional[str] = Field(
    None,
    description="Name of the event this metric is from",
)

event_type class-attribute instance-attribute

event_type: Optional[str] = Field(
    None, description="Type of event"
)

value class-attribute instance-attribute

value: Optional[Union[float, int, bool]] = Field(
    None, description="Value of the metric"
)

passed class-attribute instance-attribute

passed: Optional[bool] = Field(
    None, description="Whether this metric passed"
)

DatapointResult

Bases: BaseModel

Individual datapoint result from backend.

This represents a single datapoint in the datapoints array returned by GET /runs/:run_id/result endpoint.

Backend Response Format (per OpenAPI spec): { "datapoint_id": "dp-123", "session_id": "sess-456", "passed": true, "metrics": [ {"name": "accuracy", "value": 0.85, "passed": true}, ... ] }

Source code in src/honeyhive/experiments/models.py
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
class DatapointResult(BaseModel):
    """
    Individual datapoint result from backend.

    This represents a single datapoint in the datapoints array returned by
    GET /runs/:run_id/result endpoint.

    Backend Response Format (per OpenAPI spec):
    {
        "datapoint_id": "dp-123",
        "session_id": "sess-456",
        "passed": true,
        "metrics": [
            {"name": "accuracy", "value": 0.85, "passed": true},
            ...
        ]
    }
    """

    datapoint_id: Optional[str] = Field(None, description="ID of the datapoint")
    session_id: Optional[str] = Field(None, description="ID of the session")
    passed: Optional[bool] = Field(None, description="Whether this datapoint passed")
    metrics: List[DatapointMetric] = Field(
        default_factory=list, description="Metrics for this datapoint"
    )

    # Required for Pydantic 2.12+ when using extra="allow"
    __pydantic_extra__: Dict[str, Any] = None

    model_config = ConfigDict(extra="allow")

datapoint_id class-attribute instance-attribute

datapoint_id: Optional[str] = Field(
    None, description="ID of the datapoint"
)

session_id class-attribute instance-attribute

session_id: Optional[str] = Field(
    None, description="ID of the session"
)

passed class-attribute instance-attribute

passed: Optional[bool] = Field(
    None, description="Whether this datapoint passed"
)

metrics class-attribute instance-attribute

metrics: List[DatapointMetric] = Field(
    default_factory=list,
    description="Metrics for this datapoint",
)

AggregatedMetrics

Bases: BaseModel

Aggregated metrics model for experiment results.

Supports the backend response format with a 'details' array containing MetricDetail objects.

Backend Response Format (per OpenAPI spec): { "aggregation_function": "average", "details": [ { "metric_name": "accuracy", "metric_type": "numeric", "event_name": "llm_call", "event_type": "model", "aggregate": 0.85, "values": [0.8, 0.9, 0.85], "datapoints": {"passed": [...], "failed": [...]} }, ... ] }

Example

metrics = AggregatedMetrics( ... aggregation_function="average", ... details=[{"metric_name": "accuracy", "aggregate": 0.85}] ... ) metrics.get_metric("accuracy") MetricDetail(metric_name='accuracy', aggregate=0.85, ...) metrics.list_metrics() ['accuracy']

Source code in src/honeyhive/experiments/models.py
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
class AggregatedMetrics(BaseModel):
    """
    Aggregated metrics model for experiment results.

    Supports the backend response format with a 'details' array containing
    MetricDetail objects.

    Backend Response Format (per OpenAPI spec):
    {
      "aggregation_function": "average",
      "details": [
        {
          "metric_name": "accuracy",
          "metric_type": "numeric",
          "event_name": "llm_call",
          "event_type": "model",
          "aggregate": 0.85,
          "values": [0.8, 0.9, 0.85],
          "datapoints": {"passed": [...], "failed": [...]}
        },
        ...
      ]
    }

    Example:
        >>> metrics = AggregatedMetrics(
        ...     aggregation_function="average",
        ...     details=[{"metric_name": "accuracy", "aggregate": 0.85}]
        ... )
        >>> metrics.get_metric("accuracy")
        MetricDetail(metric_name='accuracy', aggregate=0.85, ...)
        >>> metrics.list_metrics()
        ['accuracy']
    """

    aggregation_function: Optional[str] = Field(
        None, description="Aggregation function used (average, sum, min, max)"
    )

    details: List[MetricDetail] = Field(
        default_factory=list,
        description="List of metric details from backend",
    )

    # Required for Pydantic 2.12+ when using extra="allow"
    __pydantic_extra__: Dict[str, Any] = None

    # Allow extra fields for backward compatibility with dynamic metric keys
    model_config = ConfigDict(extra="allow")

    def get_metric(
        self, metric_name: str
    ) -> Optional[Union[MetricDetail, Dict[str, Any]]]:
        """
        Get a specific metric by name.

        Supports both the new 'details' array format (returns MetricDetail)
        and the legacy model_extra format (returns dict) for backward compatibility.

        Args:
            metric_name: Name of the metric to retrieve

        Returns:
            MetricDetail object (new format), dict (legacy format), or None if not found

        Example:
            >>> metrics.get_metric("accuracy")
            MetricDetail(metric_name='accuracy', aggregate=0.85, ...)
        """
        # First check the details array (new format)
        for metric in self.details:
            if metric.metric_name == metric_name:
                return metric
        # Fall back to model_extra (legacy format for backward compatibility)
        extra = self.model_extra or {}
        return extra.get(metric_name)

    def list_metrics(self) -> List[str]:
        """
        List all metric names in this result.

        Supports both the new 'details' array format and the legacy model_extra
        format for backward compatibility.

        Returns:
            List of metric names from details array or model_extra keys

        Example:
            >>> metrics.list_metrics()
            ['accuracy', 'latency', 'cost']
        """
        # First check the details array (new format)
        if self.details:
            # pylint: disable=not-an-iterable
            return [metric.metric_name for metric in self.details]
        # Fall back to model_extra (legacy format for backward compatibility)
        extra = self.model_extra or {}
        # Exclude known fields that aren't metrics
        return [k for k in extra.keys() if k not in ("aggregation_function",)]

    def get_all_metrics(self) -> Dict[str, Union[MetricDetail, Dict[str, Any]]]:
        """
        Get all metrics as a dictionary.

        Supports both the new 'details' array format (returns MetricDetail values)
        and the legacy model_extra format (returns dict values) for backward
        compatibility.

        Returns:
            Dictionary mapping metric names to MetricDetail objects or dicts

        Example:
            >>> metrics.get_all_metrics()
            {
                'accuracy': MetricDetail(metric_name='accuracy', aggregate=0.85, ...),
                'latency': MetricDetail(metric_name='latency', aggregate=120.5, ...)
            }
        """
        # First check the details array (new format)
        if self.details:
            # pylint: disable=not-an-iterable
            return {metric.metric_name: metric for metric in self.details}
        # Fall back to model_extra (legacy format for backward compatibility)
        extra = self.model_extra or {}
        # Exclude known fields that aren't metrics
        return {k: v for k, v in extra.items() if k not in ("aggregation_function",)}

aggregation_function class-attribute instance-attribute

aggregation_function: Optional[str] = Field(
    None,
    description="Aggregation function used (average, sum, min, max)",
)

details class-attribute instance-attribute

details: List[MetricDetail] = Field(
    default_factory=list,
    description="List of metric details from backend",
)

get_metric

get_metric(
    metric_name: str,
) -> Optional[Union[MetricDetail, Dict[str, Any]]]

Get a specific metric by name.

Supports both the new 'details' array format (returns MetricDetail) and the legacy model_extra format (returns dict) for backward compatibility.

Parameters:

Name Type Description Default
metric_name str

Name of the metric to retrieve

required

Returns:

Type Description
Optional[Union[MetricDetail, Dict[str, Any]]]

MetricDetail object (new format), dict (legacy format), or None if not found

Example

metrics.get_metric("accuracy") MetricDetail(metric_name='accuracy', aggregate=0.85, ...)

Source code in src/honeyhive/experiments/models.py
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
def get_metric(
    self, metric_name: str
) -> Optional[Union[MetricDetail, Dict[str, Any]]]:
    """
    Get a specific metric by name.

    Supports both the new 'details' array format (returns MetricDetail)
    and the legacy model_extra format (returns dict) for backward compatibility.

    Args:
        metric_name: Name of the metric to retrieve

    Returns:
        MetricDetail object (new format), dict (legacy format), or None if not found

    Example:
        >>> metrics.get_metric("accuracy")
        MetricDetail(metric_name='accuracy', aggregate=0.85, ...)
    """
    # First check the details array (new format)
    for metric in self.details:
        if metric.metric_name == metric_name:
            return metric
    # Fall back to model_extra (legacy format for backward compatibility)
    extra = self.model_extra or {}
    return extra.get(metric_name)

list_metrics

list_metrics() -> List[str]

List all metric names in this result.

Supports both the new 'details' array format and the legacy model_extra format for backward compatibility.

Returns:

Type Description
List[str]

List of metric names from details array or model_extra keys

Example

metrics.list_metrics() ['accuracy', 'latency', 'cost']

Source code in src/honeyhive/experiments/models.py
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
def list_metrics(self) -> List[str]:
    """
    List all metric names in this result.

    Supports both the new 'details' array format and the legacy model_extra
    format for backward compatibility.

    Returns:
        List of metric names from details array or model_extra keys

    Example:
        >>> metrics.list_metrics()
        ['accuracy', 'latency', 'cost']
    """
    # First check the details array (new format)
    if self.details:
        # pylint: disable=not-an-iterable
        return [metric.metric_name for metric in self.details]
    # Fall back to model_extra (legacy format for backward compatibility)
    extra = self.model_extra or {}
    # Exclude known fields that aren't metrics
    return [k for k in extra.keys() if k not in ("aggregation_function",)]

get_all_metrics

get_all_metrics() -> (
    Dict[str, Union[MetricDetail, Dict[str, Any]]]
)

Get all metrics as a dictionary.

Supports both the new 'details' array format (returns MetricDetail values) and the legacy model_extra format (returns dict values) for backward compatibility.

Returns:

Type Description
Dict[str, Union[MetricDetail, Dict[str, Any]]]

Dictionary mapping metric names to MetricDetail objects or dicts

Example

metrics.get_all_metrics() { 'accuracy': MetricDetail(metric_name='accuracy', aggregate=0.85, ...), 'latency': MetricDetail(metric_name='latency', aggregate=120.5, ...) }

Source code in src/honeyhive/experiments/models.py
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
def get_all_metrics(self) -> Dict[str, Union[MetricDetail, Dict[str, Any]]]:
    """
    Get all metrics as a dictionary.

    Supports both the new 'details' array format (returns MetricDetail values)
    and the legacy model_extra format (returns dict values) for backward
    compatibility.

    Returns:
        Dictionary mapping metric names to MetricDetail objects or dicts

    Example:
        >>> metrics.get_all_metrics()
        {
            'accuracy': MetricDetail(metric_name='accuracy', aggregate=0.85, ...),
            'latency': MetricDetail(metric_name='latency', aggregate=120.5, ...)
        }
    """
    # First check the details array (new format)
    if self.details:
        # pylint: disable=not-an-iterable
        return {metric.metric_name: metric for metric in self.details}
    # Fall back to model_extra (legacy format for backward compatibility)
    extra = self.model_extra or {}
    # Exclude known fields that aren't metrics
    return {k: v for k, v in extra.items() if k not in ("aggregation_function",)}

ExperimentResultSummary

Bases: BaseModel

Aggregated experiment result from backend.

This model represents the complete result of an experiment run, including pass/fail status, aggregated metrics, and datapoint results.

Retrieved from: GET /runs/:run_id/result

Source code in src/honeyhive/experiments/models.py
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
class ExperimentResultSummary(BaseModel):
    """
    Aggregated experiment result from backend.

    This model represents the complete result of an experiment run,
    including pass/fail status, aggregated metrics, and datapoint results.

    Retrieved from: GET /runs/:run_id/result
    """

    run_id: str = Field(..., description="Experiment run identifier")

    status: str = Field(
        ..., description="Run status (pending, completed, running, failed, cancelled)"
    )

    success: bool = Field(..., description="Overall success status of the run")

    passed: List[str] = Field(
        default_factory=list, description="List of datapoint IDs that passed"
    )

    failed: List[str] = Field(
        default_factory=list, description="List of datapoint IDs that failed"
    )

    metrics: AggregatedMetrics = Field(
        ..., description="Aggregated metrics from backend"
    )

    datapoints: List[DatapointResult] = Field(
        default_factory=list,
        description="List of datapoint results from backend",
    )

    def print_table(self, run_name: Optional[str] = None) -> None:
        """
        Print evaluation results in a formatted table.

        Displays:
        - Run summary (ID, status, pass/fail counts)
        - Aggregated metrics
        - Per-datapoint details (if available)

        Args:
            run_name: Optional run name to display in table title

        Example:
            >>> result = evaluate(...)
            >>> result.print_table(run_name="My Experiment")
        """
        console = Console()

        # Print header
        title = f"Evaluation Results: {run_name or self.run_id}"
        console.print(f"\n{'=' * 80}")
        console.print(f"[bold yellow]{title}[/bold yellow]")
        console.print(f"{'=' * 80}\n")

        # Print summary
        status_emoji = "✅" if self.success else "❌"
        status_color = "green" if self.success else "red"

        console.print(f"[bold]Run ID:[/bold] {self.run_id}")
        status_text = (
            f"[bold]Status:[/bold] [{status_color}]"
            f"{status_emoji} {self.status}[/{status_color}]"
        )
        console.print(status_text)
        console.print(f"[bold]Passed:[/bold] {len(self.passed)}")
        console.print(f"[bold]Failed:[/bold] {len(self.failed)}")
        console.print()

        # Print aggregated metrics table
        metric_names = self.metrics.list_metrics()  # pylint: disable=no-member

        if metric_names:
            metrics_table = Table(
                title="Aggregated Metrics",
                show_lines=False,
                title_style=Style(color="cyan", bold=True),
            )
            metrics_table.add_column(
                "Metric", justify="left", style="magenta", no_wrap=True
            )
            metrics_table.add_column("Value", justify="right", style="green")
            metrics_table.add_column("Type", justify="center", style="blue")

            for metric_name in sorted(metric_names):
                # pylint: disable=no-member
                metric_data = self.metrics.get_metric(metric_name)
                if metric_data is not None:
                    # Handle both MetricDetail objects (new format) and dicts (legacy)
                    if isinstance(metric_data, MetricDetail):
                        aggregate_value = metric_data.aggregate
                        metric_type = metric_data.metric_type or "unknown"
                    elif isinstance(metric_data, dict):
                        aggregate_value = metric_data.get("aggregate")
                        metric_type = metric_data.get("metric_type", "unknown")
                    else:
                        aggregate_value = None
                        metric_type = "unknown"

                    # Format value based on type
                    if aggregate_value is None:
                        value_str = "N/A"
                    elif isinstance(aggregate_value, float):
                        value_str = f"{aggregate_value:.4f}"
                    else:
                        value_str = str(aggregate_value)

                    metrics_table.add_row(metric_name, value_str, metric_type)

            console.print(metrics_table)
            console.print()

        # Print per-datapoint summary if available
        if self.datapoints:
            datapoints_table = Table(
                title=f"Datapoint Results ({len(self.datapoints)} total)",
                show_lines=False,
                title_style=Style(color="cyan", bold=True),
            )
            datapoints_table.add_column(
                "Datapoint ID", justify="left", style="blue", no_wrap=False
            )
            datapoints_table.add_column(
                "Session ID", justify="left", style="blue", no_wrap=False
            )
            datapoints_table.add_column("Status", justify="center", style="green")

            for datapoint in self.datapoints[:20]:  # Limit to first 20 for display
                dp_id = datapoint.datapoint_id or "N/A"
                session_id = datapoint.session_id or "N/A"
                passed = datapoint.passed

                if passed is True:
                    status = "[green]✅ Passed[/green]"
                elif passed is False:
                    status = "[red]❌ Failed[/red]"
                else:
                    status = "❓ Unknown"

                datapoints_table.add_row(dp_id, session_id, status)

            console.print(datapoints_table)

            if len(self.datapoints) > 20:
                msg = (
                    f"\n[dim](Showing first 20 of "
                    f"{len(self.datapoints)} datapoints)[/dim]"
                )
                console.print(msg)

            console.print()

        console.print(f"{'=' * 80}\n")

run_id class-attribute instance-attribute

run_id: str = Field(
    ..., description="Experiment run identifier"
)

status class-attribute instance-attribute

status: str = Field(
    ...,
    description="Run status (pending, completed, running, failed, cancelled)",
)

success class-attribute instance-attribute

success: bool = Field(
    ..., description="Overall success status of the run"
)

passed class-attribute instance-attribute

passed: List[str] = Field(
    default_factory=list,
    description="List of datapoint IDs that passed",
)

failed class-attribute instance-attribute

failed: List[str] = Field(
    default_factory=list,
    description="List of datapoint IDs that failed",
)

metrics class-attribute instance-attribute

metrics: AggregatedMetrics = Field(
    ..., description="Aggregated metrics from backend"
)

datapoints class-attribute instance-attribute

datapoints: List[DatapointResult] = Field(
    default_factory=list,
    description="List of datapoint results from backend",
)

print_table

print_table(run_name: Optional[str] = None) -> None

Print evaluation results in a formatted table.

Displays: - Run summary (ID, status, pass/fail counts) - Aggregated metrics - Per-datapoint details (if available)

Parameters:

Name Type Description Default
run_name Optional[str]

Optional run name to display in table title

None
Example

result = evaluate(...) result.print_table(run_name="My Experiment")

Source code in src/honeyhive/experiments/models.py
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
def print_table(self, run_name: Optional[str] = None) -> None:
    """
    Print evaluation results in a formatted table.

    Displays:
    - Run summary (ID, status, pass/fail counts)
    - Aggregated metrics
    - Per-datapoint details (if available)

    Args:
        run_name: Optional run name to display in table title

    Example:
        >>> result = evaluate(...)
        >>> result.print_table(run_name="My Experiment")
    """
    console = Console()

    # Print header
    title = f"Evaluation Results: {run_name or self.run_id}"
    console.print(f"\n{'=' * 80}")
    console.print(f"[bold yellow]{title}[/bold yellow]")
    console.print(f"{'=' * 80}\n")

    # Print summary
    status_emoji = "✅" if self.success else "❌"
    status_color = "green" if self.success else "red"

    console.print(f"[bold]Run ID:[/bold] {self.run_id}")
    status_text = (
        f"[bold]Status:[/bold] [{status_color}]"
        f"{status_emoji} {self.status}[/{status_color}]"
    )
    console.print(status_text)
    console.print(f"[bold]Passed:[/bold] {len(self.passed)}")
    console.print(f"[bold]Failed:[/bold] {len(self.failed)}")
    console.print()

    # Print aggregated metrics table
    metric_names = self.metrics.list_metrics()  # pylint: disable=no-member

    if metric_names:
        metrics_table = Table(
            title="Aggregated Metrics",
            show_lines=False,
            title_style=Style(color="cyan", bold=True),
        )
        metrics_table.add_column(
            "Metric", justify="left", style="magenta", no_wrap=True
        )
        metrics_table.add_column("Value", justify="right", style="green")
        metrics_table.add_column("Type", justify="center", style="blue")

        for metric_name in sorted(metric_names):
            # pylint: disable=no-member
            metric_data = self.metrics.get_metric(metric_name)
            if metric_data is not None:
                # Handle both MetricDetail objects (new format) and dicts (legacy)
                if isinstance(metric_data, MetricDetail):
                    aggregate_value = metric_data.aggregate
                    metric_type = metric_data.metric_type or "unknown"
                elif isinstance(metric_data, dict):
                    aggregate_value = metric_data.get("aggregate")
                    metric_type = metric_data.get("metric_type", "unknown")
                else:
                    aggregate_value = None
                    metric_type = "unknown"

                # Format value based on type
                if aggregate_value is None:
                    value_str = "N/A"
                elif isinstance(aggregate_value, float):
                    value_str = f"{aggregate_value:.4f}"
                else:
                    value_str = str(aggregate_value)

                metrics_table.add_row(metric_name, value_str, metric_type)

        console.print(metrics_table)
        console.print()

    # Print per-datapoint summary if available
    if self.datapoints:
        datapoints_table = Table(
            title=f"Datapoint Results ({len(self.datapoints)} total)",
            show_lines=False,
            title_style=Style(color="cyan", bold=True),
        )
        datapoints_table.add_column(
            "Datapoint ID", justify="left", style="blue", no_wrap=False
        )
        datapoints_table.add_column(
            "Session ID", justify="left", style="blue", no_wrap=False
        )
        datapoints_table.add_column("Status", justify="center", style="green")

        for datapoint in self.datapoints[:20]:  # Limit to first 20 for display
            dp_id = datapoint.datapoint_id or "N/A"
            session_id = datapoint.session_id or "N/A"
            passed = datapoint.passed

            if passed is True:
                status = "[green]✅ Passed[/green]"
            elif passed is False:
                status = "[red]❌ Failed[/red]"
            else:
                status = "❓ Unknown"

            datapoints_table.add_row(dp_id, session_id, status)

        console.print(datapoints_table)

        if len(self.datapoints) > 20:
            msg = (
                f"\n[dim](Showing first 20 of "
                f"{len(self.datapoints)} datapoints)[/dim]"
            )
            console.print(msg)

        console.print()

    console.print(f"{'=' * 80}\n")

RunComparisonResult

Bases: BaseModel

Comparison between two experiment runs.

This model represents the delta analysis between a new run and an old run, including metric changes and datapoint differences.

Retrieved from: GET /runs/:new_run_id/compare-with/:old_run_id

Source code in src/honeyhive/experiments/models.py
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
class RunComparisonResult(BaseModel):
    """
    Comparison between two experiment runs.

    This model represents the delta analysis between a new run and an old run,
    including metric changes and datapoint differences.

    Retrieved from: GET /runs/:new_run_id/compare-with/:old_run_id
    """

    new_run_id: str = Field(..., description="New experiment run identifier")

    old_run_id: str = Field(..., description="Old experiment run identifier")

    common_datapoints: int = Field(
        ..., description="Number of datapoints common to both runs"
    )

    new_only_datapoints: int = Field(
        default=0, description="Number of datapoints only in new run"
    )

    old_only_datapoints: int = Field(
        default=0, description="Number of datapoints only in old run"
    )

    metric_deltas: Dict[str, Any] = Field(
        default_factory=dict, description="Metric name to delta information mapping"
    )

    def get_metric_delta(self, metric_name: str) -> Optional[Dict[str, Any]]:
        """
        Get delta information for a specific metric.

        Args:
            metric_name: Name of the metric

        Returns:
            Delta information including new_value, old_value, delta, percent_change

        Example:
            >>> comparison.get_metric_delta("accuracy")
            {
                'new_value': 0.85,
                'old_value': 0.80,
                'delta': 0.05,
                'percent_change': 6.25
            }
        """
        return self.metric_deltas.get(metric_name)  # pylint: disable=no-member

    def list_improved_metrics(self) -> List[str]:
        """
        List metrics that improved in the new run.

        Returns:
            List of metric names where improved_count > 0
        """
        improved = []
        for (
            metric_name,
            delta_info,
        ) in self.metric_deltas.items():  # pylint: disable=no-member
            if isinstance(delta_info, dict) and delta_info.get("improved_count", 0) > 0:
                improved.append(metric_name)
        return improved

    def list_degraded_metrics(self) -> List[str]:
        """
        List metrics that degraded in the new run.

        Returns:
            List of metric names where degraded_count > 0
        """
        degraded = []
        for (
            metric_name,
            delta_info,
        ) in self.metric_deltas.items():  # pylint: disable=no-member
            if isinstance(delta_info, dict) and delta_info.get("degraded_count", 0) > 0:
                degraded.append(metric_name)
        return degraded

new_run_id class-attribute instance-attribute

new_run_id: str = Field(
    ..., description="New experiment run identifier"
)

old_run_id class-attribute instance-attribute

old_run_id: str = Field(
    ..., description="Old experiment run identifier"
)

common_datapoints class-attribute instance-attribute

common_datapoints: int = Field(
    ...,
    description="Number of datapoints common to both runs",
)

new_only_datapoints class-attribute instance-attribute

new_only_datapoints: int = Field(
    default=0,
    description="Number of datapoints only in new run",
)

old_only_datapoints class-attribute instance-attribute

old_only_datapoints: int = Field(
    default=0,
    description="Number of datapoints only in old run",
)

metric_deltas class-attribute instance-attribute

metric_deltas: Dict[str, Any] = Field(
    default_factory=dict,
    description="Metric name to delta information mapping",
)

get_metric_delta

get_metric_delta(
    metric_name: str,
) -> Optional[Dict[str, Any]]

Get delta information for a specific metric.

Parameters:

Name Type Description Default
metric_name str

Name of the metric

required

Returns:

Type Description
Optional[Dict[str, Any]]

Delta information including new_value, old_value, delta, percent_change

Example

comparison.get_metric_delta("accuracy") { 'new_value': 0.85, 'old_value': 0.80, 'delta': 0.05, 'percent_change': 6.25 }

Source code in src/honeyhive/experiments/models.py
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
def get_metric_delta(self, metric_name: str) -> Optional[Dict[str, Any]]:
    """
    Get delta information for a specific metric.

    Args:
        metric_name: Name of the metric

    Returns:
        Delta information including new_value, old_value, delta, percent_change

    Example:
        >>> comparison.get_metric_delta("accuracy")
        {
            'new_value': 0.85,
            'old_value': 0.80,
            'delta': 0.05,
            'percent_change': 6.25
        }
    """
    return self.metric_deltas.get(metric_name)  # pylint: disable=no-member

list_improved_metrics

list_improved_metrics() -> List[str]

List metrics that improved in the new run.

Returns:

Type Description
List[str]

List of metric names where improved_count > 0

Source code in src/honeyhive/experiments/models.py
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
def list_improved_metrics(self) -> List[str]:
    """
    List metrics that improved in the new run.

    Returns:
        List of metric names where improved_count > 0
    """
    improved = []
    for (
        metric_name,
        delta_info,
    ) in self.metric_deltas.items():  # pylint: disable=no-member
        if isinstance(delta_info, dict) and delta_info.get("improved_count", 0) > 0:
            improved.append(metric_name)
    return improved

list_degraded_metrics

list_degraded_metrics() -> List[str]

List metrics that degraded in the new run.

Returns:

Type Description
List[str]

List of metric names where degraded_count > 0

Source code in src/honeyhive/experiments/models.py
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
def list_degraded_metrics(self) -> List[str]:
    """
    List metrics that degraded in the new run.

    Returns:
        List of metric names where degraded_count > 0
    """
    degraded = []
    for (
        metric_name,
        delta_info,
    ) in self.metric_deltas.items():  # pylint: disable=no-member
        if isinstance(delta_info, dict) and delta_info.get("degraded_count", 0) > 0:
            degraded.append(metric_name)
    return degraded