Reviewing Results

Analyze run history, pass/fail outcomes, compare results, and provide feedback

Overview

After an evaluation run completes, the results view provides detailed analysis of pass/fail outcomes, response comparisons, and trend tracking across multiple runs. Use this data to identify regressions, improve prompts, and validate model changes.

Results Overview

Open Run Results

Navigate to Evaluations, select your project, and click on a completed run from the Run History list.

Review Summary Metrics

The results summary shows:

Overall Pass Rate — Percentage of test cases that passed
Total Test Cases — Number of test cases executed
Passed / Failed / Error — Breakdown by outcome
Average Score — Mean score across all test cases (for semantic scoring)
Duration — Total run time

Review Individual Test Cases

Click on any test case to see the detailed result:

Input — The prompt sent to the target
Expected Output — What was expected
Actual Output — What the bot/agent responded
Score — The evaluation score
Verdict — Pass or fail with explanation

Result Comparison

Compare results across multiple runs to track improvements and regressions:

Select Runs to Compare

From the run history, select two or more runs using the checkboxes, then click Compare.

View Side-by-Side Results

The comparison view shows each test case's outcome across the selected runs in a side-by-side table. Color-coded cells highlight improvements (green), regressions (red), and unchanged results (gray).

Identify Regressions

Filter the comparison to show only regressions — test cases that passed in an earlier run but failed in a later one. These are high-priority items to investigate.

Feedback Logs

Add feedback to individual test case results to track observations and action items:

Notes — Add text annotations explaining why a test case passed or failed
Override Verdict — Manually override a pass/fail verdict if the automated scoring was incorrect
Action Items — Flag test cases for follow-up (prompt improvement, data lake update, etc.)

Trend Tracking

The evaluation project dashboard shows trend charts across all runs:

Pass Rate Over Time — Line chart showing pass rate progression
Category Breakdown — Pass rates by test case category/tag
Model Comparison — Side-by-side pass rates for different models

Tip: Set up a weekly scheduled evaluation and review the trend chart monthly. This gives you a clear picture of your bot's quality trajectory and helps catch regressions early.

Important: Automated scoring is not perfect. Regularly review failed test cases to ensure the scoring criteria are appropriate. Use manual verdict overrides when the automated score does not reflect actual quality.

Next Steps

Creating Test Cases — Add more test cases based on findings
Running Evaluations — Run updated evaluations
Using RAG Context — Improve RAG to fix failing test cases