Overview
After an evaluation run completes, the results view provides detailed analysis of pass/fail outcomes, response comparisons, and trend tracking across multiple runs. Use this data to identify regressions, improve prompts, and validate model changes.
Results Overview
Open Run Results
Navigate to Evaluations, select your project, and click on a completed run from the Run History list.
Review Summary Metrics
The results summary shows:
- Overall Pass Rate — Percentage of test cases that passed
- Total Test Cases — Number of test cases executed
- Passed / Failed / Error — Breakdown by outcome
- Average Score — Mean score across all test cases (for semantic scoring)
- Duration — Total run time
Review Individual Test Cases
Click on any test case to see the detailed result:
- Input — The prompt sent to the target
- Expected Output — What was expected
- Actual Output — What the bot/agent responded
- Score — The evaluation score
- Verdict — Pass or fail with explanation
Result Comparison
Compare results across multiple runs to track improvements and regressions:
Select Runs to Compare
From the run history, select two or more runs using the checkboxes, then click Compare.
View Side-by-Side Results
The comparison view shows each test case's outcome across the selected runs in a side-by-side table. Color-coded cells highlight improvements (green), regressions (red), and unchanged results (gray).
Identify Regressions
Filter the comparison to show only regressions — test cases that passed in an earlier run but failed in a later one. These are high-priority items to investigate.
Feedback Logs
Add feedback to individual test case results to track observations and action items:
- Notes — Add text annotations explaining why a test case passed or failed
- Override Verdict — Manually override a pass/fail verdict if the automated scoring was incorrect
- Action Items — Flag test cases for follow-up (prompt improvement, data lake update, etc.)
Trend Tracking
The evaluation project dashboard shows trend charts across all runs:
- Pass Rate Over Time — Line chart showing pass rate progression
- Category Breakdown — Pass rates by test case category/tag
- Model Comparison — Side-by-side pass rates for different models
Tip: Set up a weekly scheduled evaluation and review the trend chart monthly. This gives you a clear picture of your bot's quality trajectory and helps catch regressions early.
Important: Automated scoring is not perfect. Regularly review failed test cases to ensure the scoring criteria are appropriate. Use manual verdict overrides when the automated score does not reflect actual quality.