Scoring & Evaluation

How our multi-judge system ensures fair and reliable model comparisons

Ensemble Judging

Our scoring system uses multiple judge models to evaluate responses across several dimensions:

Evaluation Criteria

  • Relevance: How well the response addresses the prompt
  • Accuracy: Factual correctness and precision
  • Coherence: Logical structure and flow
  • Helpfulness: Practical utility and value

Scoring Scale

  • 1-3: Poor quality response
  • 4-6: Adequate but improvable
  • 7-9: Good quality response
  • 10: Exceptional response

Judge Agreement

To ensure reliability, we measure inter-judge agreement and show confidence intervals:

High Agreement (≥80%)

Judges are aligned on the evaluation. Results are highly reliable.

Moderate Agreement (60-80%)

Some variation between judges. Consider reviewing rationales for insights.

Low Agreement (<60%)

Significant disagreement. May indicate ambiguous criteria or edge cases.

Judge Rationales

Each judge provides detailed explanations for their scores:

Transparent Evaluation

Understand why a response received a particular score with detailed reasoning.

Learning Opportunity

Use rationales to improve your prompts and understand model behavior.

Disable Judging

You can disable automatic judging at any time if you prefer to evaluate responses manually. This is useful when:

  • • You have domain-specific evaluation criteria
  • • You want to focus on raw model outputs
  • • You're testing experimental prompts
  • • You prefer human-only evaluation