Troubleshooting
Common issues and solutions for running experiments.
Slow Experiments
Problem: My experiments take too long
Solutions:
Use Parallel Execution:
result = evaluate(
function=my_function,
dataset=dataset,
max_workers=20, # Process 20 items at once
api_key="your-api-key",
project="your-project"
)
Start with Smaller Dataset:
# Test on sample first
result = evaluate(
function=my_function,
dataset=dataset[:100], # First 100 items
api_key="your-api-key",
project="your-project"
)
Reduce LLM-as-Judge Evaluators:
LLM evaluators are expensive. Use cheaper models or fewer evaluators.
Evaluator Errors
Problem: My evaluator is throwing errors
Solution: Add Error Handling:
@evaluator()
def robust_evaluator(outputs, inputs, ground_truth):
try:
score = calculate_score(outputs, ground_truth)
return {"score": score}
except Exception as e:
return {"score": 0.0, "error": str(e)}
Inconsistent Results
Problem: LLM-as-judge gives different scores each time
Solution: Use temperature=0.0:
@evaluator()
def consistent_judge(outputs, inputs, ground_truth):
response = client.chat.completions.create(
model="gpt-4",
messages=[...],
temperature=0.0, # Deterministic
seed=42
)
return score
Missing Results
Problem: I don’t see results in the dashboard
Checklist:
Check API key and project name
Verify experiment completed successfully
Wait a few seconds for backend processing
Check run_id in dashboard search
See Also
Running Experiments - Core workflows
Best Practices - Evaluation strategies