Running Evaluations

Execute manual and scheduled evaluation runs with configurable parameters

Overview

Evaluation runs execute your test cases against a bot or agent and produce pass/fail results. Runs can be triggered manually for on-demand testing or scheduled via cron for continuous quality monitoring.

Manual Evaluation Run

Open the Evaluation Project

Navigate to Evaluations and select the project containing your test cases.

Configure the Run

Click New Run and configure the run parameters:

  • Target — Select the bot or agent to evaluate
  • Test Case Filter — Optionally filter by tags to run a subset of test cases
  • Model Override — Optionally override the target's default model for A/B testing

Start the Run

Click Run to begin execution. Each test case is sent to the target sequentially.

Monitor Progress

The run dashboard shows real-time progress: total test cases, completed, passed, and failed. Individual test results stream in as they complete.

View Results

When the run completes, the results summary shows overall pass rate, individual test case outcomes, and any errors. See Reviewing Results for detailed analysis.

Scheduled Evaluation Runs

Open Evaluation Settings

In the evaluation project, go to Settings > Schedule.

Set the Cron Schedule

Enter a cron expression for the run frequency (e.g., 0 9 * * MON for every Monday at 9 AM).

Configure Run Parameters

Set the same parameters as a manual run: target, test case filters, and model override.

Enable the Schedule

Toggle the schedule to Enabled and save. Runs will execute automatically at the configured times.

Execution Flow

Each evaluation run follows this sequence:

  1. Initialization — Load test cases, connect to target bot/agent
  2. Execution — Send each test case input to the target and capture the response
  3. Scoring — Compare responses against expected outputs using the configured scoring method
  4. Aggregation — Calculate overall pass rate and generate the results summary
  5. Notification — Send results notification via dashboard and optional email/webhook
Tip: Use model override to compare the same test cases across different LLM models. This helps you identify which model performs best for your specific use case.
Important: Each test case in an evaluation run consumes LLM tokens. A run with 100 test cases will make 100 API calls to the target model (plus judge calls for semantic scoring). Plan your budget accordingly.

Next Steps