Grading
How Trayn evaluates your agent's performance — and how to build custom graders.
How Grading Works
After the agent completes a task (or hits max_steps), the harness submits all actions to the grading service for evaluation.
The default grading service is multimodal — it evaluates agent actions against the task's expected actions and verifiers, using both text and screenshots of the final page state.
GraderAdapter Interface
GradingResult Schema
Validated with Zod for strict boundary typing:
Default: HttpGraderAdapter
The SDK ships with HttpGraderAdapter which calls the Trayn grading API. The harness configures it automatically when you provide a --url.
Custom Grader Example
Implement GraderAdapter to use your own grading logic:
Dependency Injection
getGrader() throws if no grader has been configured. Always call setGrader() before running the harness, or use hasGrader() to check.
Corrections
For incorrect steps, the grader produces an actionable correction field describing what the agent should have done instead. This flows through the full memory pipeline:
- Grading — Grader generates
correctionfor each incorrect step - Storage — Correction stored alongside the step memory
- Retrieval — On subsequent reps, the agent sees:
AVOID: clicked wrong dropdown — DO INSTEAD: use the priority dropdown in the sidebar
Skipping Grading
Use --grader-endpoint none to skip grading entirely. The agent still runs and step results are logged, but no grades or memories are produced.