Reviewing Results

Analyze run history, pass/fail outcomes, compare results, and provide feedback

Overview

After an evaluation run completes, the results view provides detailed analysis of pass/fail outcomes, response comparisons, and trend tracking across multiple runs. Use this data to identify regressions, improve prompts, and validate model changes.

Results Overview

Open Run Results

Navigate to Evaluations, select your project, and click on a completed run from the Run History list.

Review Summary Metrics

The results summary shows:

  • Overall Pass Rate — Percentage of test cases that passed
  • Total Test Cases — Number of test cases executed
  • Passed / Failed / Error — Breakdown by outcome
  • Average Score — Mean score across all test cases (for semantic scoring)
  • Duration — Total run time

Review Individual Test Cases

Click on any test case to see the detailed result:

  • Input — The prompt sent to the target
  • Expected Output — What was expected
  • Actual Output — What the bot/agent responded
  • Score — The evaluation score
  • Verdict — Pass or fail with explanation

Result Comparison

Compare results across multiple runs to track improvements and regressions:

Select Runs to Compare

From the run history, select two or more runs using the checkboxes, then click Compare.

View Side-by-Side Results

The comparison view shows each test case's outcome across the selected runs in a side-by-side table. Color-coded cells highlight improvements (green), regressions (red), and unchanged results (gray).

Identify Regressions

Filter the comparison to show only regressions — test cases that passed in an earlier run but failed in a later one. These are high-priority items to investigate.

Feedback Logs

Add feedback to individual test case results to track observations and action items:

Tip: Set up a weekly scheduled evaluation and review the trend chart monthly. This gives you a clear picture of your bot's quality trajectory and helps catch regressions early.
Important: Automated scoring is not perfect. Regularly review failed test cases to ensure the scoring criteria are appropriate. Use manual verdict overrides when the automated score does not reflect actual quality.

Next Steps