LLM-as-Judge

Using a second AI model to evaluate the quality of a primary agent's output.

Why it matters

An LLM-as-Judge can automatically evaluate thousands of agent outputs for accuracy and safety.

In practice

Our QA Judge subagent validates each feature against boolean pass/fail criteria in the PRD.

Related terms

Systematic testing of agent performance: accuracy, safety, reliability.

Rules and constraints that prevent an agent from taking harmful or unauthorized actions.

A group of specialized agents that communicate directly with each other and divide work collaboratively.