Troubleshooting

Common issues and solutions for running experiments.

Slow Experiments

Problem: My experiments take too long

Solutions:

  1. Use Parallel Execution:

result = evaluate(
    function=my_function,
    dataset=dataset,
    max_workers=20,  # Process 20 items at once
    api_key="your-api-key",
    project="your-project"
)
  1. Start with Smaller Dataset:

# Test on sample first
result = evaluate(
    function=my_function,
    dataset=dataset[:100],  # First 100 items
    api_key="your-api-key",
    project="your-project"
)
  1. Reduce LLM-as-Judge Evaluators:

LLM evaluators are expensive. Use cheaper models or fewer evaluators.

Evaluator Errors

Problem: My evaluator is throwing errors

Solution: Add Error Handling:

@evaluator()
def robust_evaluator(outputs, inputs, ground_truth):
    try:
        score = calculate_score(outputs, ground_truth)
        return {"score": score}
    except Exception as e:
        return {"score": 0.0, "error": str(e)}

Inconsistent Results

Problem: LLM-as-judge gives different scores each time

Solution: Use temperature=0.0:

@evaluator()
def consistent_judge(outputs, inputs, ground_truth):
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[...],
        temperature=0.0,  # Deterministic
        seed=42
    )
    return score

Missing Results

Problem: I don’t see results in the dashboard

Checklist:

  1. Check API key and project name

  2. Verify experiment completed successfully

  3. Wait a few seconds for backend processing

  4. Check run_id in dashboard search

See Also