honeyhive.experiments.results
Result functions for experiments module.
This module provides functions to retrieve experiment results from backend endpoints.
CRITICAL: DO NOT compute aggregates client-side! The backend already provides sophisticated aggregation endpoints that compute: - Pass/fail determination - Metric aggregations (average, sum, min, max) - Composite metrics - Run comparisons with deltas
Backend Endpoints: - GET /runs/:run_id/result - Get aggregated result - GET /runs/:run_id/metrics - Get raw metrics - GET /runs/:new_run_id/compare-with/:old_run_id - Compare runs
get_run_result
get_run_result(
client: Any,
run_id: str,
project_id: Optional[str] = None,
aggregate_function: str = "average",
) -> ExperimentResultSummary
Get aggregated experiment result from backend.
Backend Endpoint: GET /runs/:run_id/result?aggregate_function=
The backend computes: - Pass/fail status for each datapoint - Metric aggregations (average, sum, min, max) - Composite metrics - Overall run status
❌ DO NOT compute these client-side! ✅ Use backend endpoint for all aggregations
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client
|
Any
|
HoneyHive API client |
required |
run_id
|
str
|
Experiment run ID |
required |
project_id
|
Optional[str]
|
Deprecated and ignored. Project scope is determined by the API key. |
None
|
aggregate_function
|
str
|
Aggregation function ("average", "sum", "min", "max") |
'average'
|
Returns:
| Type | Description |
|---|---|
ExperimentResultSummary
|
ExperimentResultSummary with all aggregated metrics |
Raises:
| Type | Description |
|---|---|
HTTPError
|
If backend request fails |
ValueError
|
If response format is invalid |
Examples:
>>> from honeyhive import HoneyHive
>>> client = HoneyHive(api_key="...")
>>> result = get_run_result(client, "run-123", None, "average")
>>> result.success
True
>>> result.metrics.get_metric("accuracy")
{'aggregate': 0.85, 'values': [0.8, 0.9, 0.85]}
Source code in src/honeyhive/experiments/results.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | |
get_run_metrics
Get raw metrics for a run (without aggregation).
Backend Endpoint: GET /runs/:run_id/result (returns metrics in response)
This returns raw metric data without aggregation, useful for: - Debugging individual datapoint metrics - Custom aggregation logic (if needed) - Detailed metric analysis
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client
|
Any
|
HoneyHive API client |
required |
run_id
|
str
|
Experiment run ID |
required |
project_id
|
Optional[str]
|
Deprecated and ignored. Project scope is determined by the API key. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Raw metrics data from backend |
Examples:
>>> metrics = get_run_metrics(client, "run-123")
>>> metrics["events"]
[{'event_id': '...', 'metrics': {...}}, ...]
Source code in src/honeyhive/experiments/results.py
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
compare_runs
compare_runs(
client: Any,
new_run_id: str,
old_run_id: str,
project_id: Optional[str] = None,
aggregate_function: str = "average",
) -> RunComparisonResult
Compare two experiment runs using backend aggregated comparison.
Backend Endpoint: GET /runs/:new_run_id/compare-with/:old_run_id
The backend computes aggregated metrics for both runs and then compares them: - Common datapoints between runs (by datapoint_id) - Per-metric improved/degraded/same classification - Old and new aggregate values for each metric - Statistical aggregation (average, sum, min, max)
❌ DO NOT compute these client-side! ✅ Use backend endpoint for all comparisons
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client
|
Any
|
HoneyHive API client |
required |
new_run_id
|
str
|
New experiment run ID |
required |
old_run_id
|
str
|
Old experiment run ID |
required |
project_id
|
Optional[str]
|
Deprecated and ignored. Project scope is determined by the API key. |
None
|
aggregate_function
|
str
|
Aggregation function ("average", "sum", "min", "max") |
'average'
|
Returns:
| Type | Description |
|---|---|
RunComparisonResult
|
RunComparisonResult with delta calculations |
Examples:
>>> comparison = compare_runs(client, "run-new", "run-old")
>>> comparison.common_datapoints
3
>>> delta = comparison.get_metric_delta("accuracy")
>>> delta
{
'old_aggregate': 0.80,
'new_aggregate': 0.85,
'found_count': 3,
'improved_count': 1,
'degraded_count': 0,
'improved': ['EXT-abc123'],
'degraded': []
}
>>> comparison.list_improved_metrics()
['accuracy', 'error_rate']
>>> comparison.list_degraded_metrics()
[]
Source code in src/honeyhive/experiments/results.py
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 | |