How we measure AI contract quality
A look inside the measurement system behind our generation pipeline, and why quality gates matter more than demos in legal AI.
The hardest problem in AI-native legal tooling is not generating a contract. Any modern language model can produce something that looks like an NDA. The hard problem is knowing, with confidence, whether the contract it produced is actually good — accurate, enforceable, appropriate to the jurisdiction, and free of the quiet errors that turn a two-page document into a future dispute.
At Contracts.io, we treat evaluation as a first-class engineering discipline. The quality of what we ship is a direct function of the quality of how we measure it. This article describes our evaluation methodology: the pipeline stages, the rubrics, the golden set, the LLM-judge layer, and the regression gates that every change to our contract-generation stack has to pass.
Why "it looks right" is not an evaluation
It is tempting to evaluate generated contracts by reading them. This does not scale, and it is surprisingly unreliable. A human reviewer reading an NDA for the tenth time that morning will miss a subtle error in the indemnification clause. A human reviewer who wrote the prompt that produced the output will unconsciously grade their own work on a curve. The only way to know whether a generation pipeline is improving is to measure it with a method that is both deterministic and specific to legal quality.
Our evaluation system is built around three commitments: every change is measured, the measurement is the same across runs, and regressions block release.
Pipeline stages
Contract generation at Contracts.io is not a single prompt. It is a multi-stage pipeline, and we evaluate each stage independently before we evaluate the pipeline end-to-end.
Stage coverage
5
Intent, template, clause, assembly, and review stages are checked separately.
Critical clauses
100%
Mandatory pass gates for the clauses most likely to create legal exposure.
Release rule
Block
A regression stops the change until the cause is understood.
- Intent extraction. The user describes what they want. We extract structured intent: document type, parties, jurisdiction, key terms, risk posture. Evaluation at this stage measures whether the extracted intent matches what a careful human reader would have extracted from the same input.
- Template selection. Given the intent, we select a base template and a set of clauses. Evaluation here measures whether the selected template is appropriate for the stated purpose, and whether mandatory clauses for the jurisdiction are present.
- Clause drafting. Each clause is drafted or adapted from the template. Evaluation measures legal accuracy of each clause in isolation — does the indemnity actually indemnify, does the governing-law clause name a real jurisdiction, are defined terms used consistently.
- Document assembly. Clauses are assembled into a full document with numbering, cross-references, and definitions. Evaluation measures internal consistency — do defined terms resolve, do cross-references point at the right clauses, is the numbering correct.
- Review pass. A final pass checks for risk signals, missing clauses, and jurisdiction-specific requirements. Evaluation measures whether the review pass catches known errors and does not raise false alarms on known-good documents.
Rubrics: what "good" means, written down
For each stage, we maintain a rubric that defines what a correct output looks like. The rubric is not a vague checklist. It is a structured document with graded criteria, each scored on a defined scale, with explicit examples of what a 1, a 3, and a 5 look like.
For a clause like governing law, the rubric asks questions such as: Does the clause name a real jurisdiction? Is the jurisdiction appropriate given the parties' locations? Is the "without regard to conflict of laws principles" carve-out present? Is the venue clause consistent with the choice-of-law clause? Each criterion has a weight, and the clause-level score is a weighted average.
Rubrics are versioned. When we update a rubric, we rerun the entire evaluation suite against the new version so that old scores and new scores remain comparable. Rubric changes are reviewed by our legal team the same way code changes are reviewed by engineers.
Contracts.io methodologyRead the trust model behind our AI boundaryThe public trust page explains provenance, review boundaries, auditability, and why AI can propose but never decide.The golden set
The center of our evaluation is a curated collection of inputs and expected outputs we call the golden set. Each entry is a realistic user request — an NDA for a vendor engagement, a freelance contract with milestone billing, a services agreement with a European counterparty — paired with a reference document that our legal team has reviewed and approved.
The golden set is the hardest part of an evaluation system to build, and the most valuable. It is the ground truth. When we make a change to the pipeline, we regenerate outputs for every entry in the golden set, score each one against the rubric, and compare the new score distribution against the baseline. If the average score drops, or if any individual entry drops by more than a threshold, the change is blocked until the regression is understood.
The golden set grows over time. Every bug report, every user-reported issue, and every interesting edge case we encounter in production becomes a new entry. This is how a quality system compounds: what the pipeline got wrong once, it cannot get wrong again silently.
The LLM-as-judge layer
Some rubric criteria can be checked with deterministic rules — whether a defined term appears in the definitions section, whether cross-references resolve, whether required clauses are present. Those checks run first because they are fast and cheap.
Other criteria require judgment: whether a clause reads naturally, whether an explanation is appropriate for the audience, whether the tone is conservative enough for a legal document. For those, we use an LLM-as-judge: a separate model, with its own prompt, that scores the output against the rubric. The judge model is not the same model we use for generation, and its prompts are maintained and versioned independently.
We do not take the judge's word as final. Judge scores are calibrated against human scores on a sample of the golden set each month. When the judge drifts from human evaluators, we investigate — usually the judge prompt needs updating, sometimes the rubric itself does.
Regression gates
All of this would be decoration if it did not block bad changes. Every significant change to the contract pipeline — a new prompt, a new model, a new template, a retrieval index update — runs through the full evaluation suite before it is merged. The gate is strict:
- No individual golden-set entry may drop more than a set threshold from baseline.
- The aggregate score, weighted by clause importance, must not regress.
- Critical-clause checks (indemnity, limitation of liability, governing law, execution blocks) must remain at 100% pass.
- Jurisdiction-specific mandatory-clause checks must remain at 100% pass.
A change that fails the gate does not ship. It goes back to the engineer or prompt author with a diff of which entries regressed and why. This is how we keep confidence in the system as it evolves — not by hoping the changes are good, but by measuring.
What we do not claim
We do not claim that our pipeline replaces a lawyer. We claim that, for the kinds of contracts we generate, we have a measurable, repeatable, and conservative understanding of how well the pipeline performs — and that our system is designed to refuse to ship regressions. When a contract needs true legal judgment, we route it to human review. The evaluation system exists to make sure the routing decision itself is based on evidence, not hope.
If you want a longer read on the trust model behind this — how we handle audit logs, provenance, and the boundary between AI proposal and human approval — see our trust page.
This article is for general information only and is not legal advice.