Scoring & Evaluation
How our multi-judge system ensures fair and reliable model comparisons
Ensemble Judging
Our scoring system uses multiple judge models to evaluate responses across several dimensions:
Evaluation Criteria
- • Relevance: How well the response addresses the prompt
- • Accuracy: Factual correctness and precision
- • Coherence: Logical structure and flow
- • Helpfulness: Practical utility and value
Scoring Scale
- • 1-3: Poor quality response
- • 4-6: Adequate but improvable
- • 7-9: Good quality response
- • 10: Exceptional response
Judge Agreement
To ensure reliability, we measure inter-judge agreement and show confidence intervals:
High Agreement (≥80%)
Judges are aligned on the evaluation. Results are highly reliable.
Moderate Agreement (60-80%)
Some variation between judges. Consider reviewing rationales for insights.
Low Agreement (<60%)
Significant disagreement. May indicate ambiguous criteria or edge cases.
Judge Rationales
Each judge provides detailed explanations for their scores:
Transparent Evaluation
Understand why a response received a particular score with detailed reasoning.
Learning Opportunity
Use rationales to improve your prompts and understand model behavior.
Disable Judging
You can disable automatic judging at any time if you prefer to evaluate responses manually. This is useful when:
- • You have domain-specific evaluation criteria
- • You want to focus on raw model outputs
- • You're testing experimental prompts
- • You prefer human-only evaluation