Grading

How Trayn evaluates your agent's performance — and how to build custom graders.

How Grading Works

After the agent completes a task (or hits max_steps), the harness submits all actions to the grading service for evaluation.

flowchart LR
  A["Agent actions"] --> H["Harness"]
  H --> G["GraderAdapter"]
  G --> R["GradingResult"]
  R --> M["store_grades()"]
  M --> S["Memories stored"]

The default grading service is multimodal — it evaluates agent actions against the task's expected actions and verifiers, using both text and screenshots of the final page state.

GraderAdapter Interface

import type { GradingResult } from "@trayn/memory";
 
interface GraderAdapter {
  grade(request: GradingRequest): Promise<GradingResult>;
}
 
interface GradingRequest {
  sessionId: string;
  runId: string;
  taskDescription: string;
  expectedActions: Array<{ stepNum: number; raw: string; semanticAction: unknown }>;
  steps: Array<{
    stepId: string;
    stepNum: number;
    action: string;
    semanticAction: unknown;
    url?: string;
    details?: Record<string, unknown> | null;
    clickSeq?: number;
  }>;
  verifiers?: Array<{
    type: "screenshot" | "jmespath";
    description: string;
    expectedState?: string;
    query?: string;
    expected_value?: string | boolean;
  }>;
  host?: string;
}

GradingResult Schema

Validated with Zod for strict boundary typing:

import { gradingResultSchema, stepGradeSchema } from "@trayn/memory";
 
// gradingResultSchema validates:
interface GradingResult {
  taskCompleted: boolean;
  completionConfidence: number;  // 0.0 – 1.0
  overallExplanation: string;
  stepGrades: StepGrade[];
}
 
interface StepGrade {
  stepId?: string;
  stepNumber: number;
  action: string;
  isCorrect: boolean;
  explanation: string;
  correction?: string;
  confidence: number;   // 0.0 – 1.0
}

Default: HttpGraderAdapter

The SDK ships with HttpGraderAdapter which calls the Trayn grading API. The harness configures it automatically when you provide a --url.

Custom Grader Example

Implement GraderAdapter to use your own grading logic:

import type { GraderAdapter, GradingRequest, GradingResult } from "@trayn/memory";
import { setGrader, gradingResultSchema } from "@trayn/memory";
 
class MyGrader implements GraderAdapter {
  async grade(request: GradingRequest): Promise<GradingResult> {
    // Call your grading model / service
    const result = await myGradingService.evaluate(request);
 
    // Validate with Zod schema for strict typing
    return gradingResultSchema.parse(result);
  }
}
 
setGrader(new MyGrader());

Dependency Injection

import { setGrader, getGrader, hasGrader } from "@trayn/memory";
 
// Set a custom grader
setGrader(myGrader);
 
// Check if a grader is configured
if (hasGrader()) {
  const result = await getGrader().grade(request);
}

getGrader() throws if no grader has been configured. Always call setGrader() before running the harness, or use hasGrader() to check.

Corrections

For incorrect steps, the grader produces an actionable correction field describing what the agent should have done instead. This flows through the full memory pipeline:

Grading — Grader generates correction for each incorrect step
Storage — Correction stored alongside the step memory
Retrieval — On subsequent reps, the agent sees: AVOID: clicked wrong dropdown — DO INSTEAD: use the priority dropdown in the sidebar

Skipping Grading

Use --grader-endpoint none to skip grading entirely. The agent still runs and step results are logged, but no grades or memories are produced.

trayn --url https://app.trayn.ai/playground/{host}/{sessionId} --grader-endpoint none

On this page