Contracts.io · Quality methodology

How we measure whether our contracts are any good.

Most legal-AI tools ask you to trust the output. We would rather show our work. This page describes, in plain terms, the pipeline our generator runs and the eval harness we grade it with — the same rubric our lawyers maintain and the same numbers our engineers see on every build.

Latest verified run · T-01

4.5

/ 5.00

Mutual NDA · US · Delaware

verified 2026-04-14 · $0.034 / run · ~33s latency

The short answer

If you only read one paragraph

Every contract our AI generates passes through a seven-stage agentic pipeline. At stage six the draft is scored against a rubric specific to its contract type and jurisdiction — the same rubrics our lawyers write and maintain, codified as deterministic checks a machine can evaluate. Any score below 4.0 out of 5.0 triggers a single repair pass, capped at one iteration to bound cost. Separately, the same rubric suite runs as an eval harness against a curated golden set on every model or prompt change, and any regression of more than 0.3 points fails the build. The result is that the generator is not graded by vibes. It is graded by a ruler, and the ruler is public.

The pipeline

Seven stages, one expensive model, zero guessing

Each stage produces a typed output the next stage can reason about. Cheap classifiers run first; the expensive model runs only when the inputs are ready.

01
Classify intent
Haiku
The user prompt is parsed into a typed IntentSpec — contract type, jurisdiction, parties, purpose, term, and any special clauses requested. A cheap classifier that converts free-text into structured input the rest of the pipeline can reason about.
02
Gate · missing info
Deterministic
Before any model writes a word, the IntentSpec is checked against the minimum-viable fields for that contract type. Anything missing returns a clarifying question to the user instead of a guess. The generator prefers asking over acting.
03
Select template
Retrieval
A lawyer-vetted template is looked up from the template library, keyed on (contract type × jurisdiction). If no jurisdiction-specific template exists, the generator falls back to a generic one and records the fallback in provenance.
04
Retrieve clauses
Retrieval + vector
Jurisdiction-specific clauses are pulled from the clause library: governing law, dispute resolution, data-protection overlays (GDPR when an EU party is present), statutory carveouts. Retrieval is scoped workspace-first, then organization, then global.
05
Draft
Sonnet
The template is populated with the IntentSpec and the retrieved clauses. This is the only stage that runs the expensive model, and it produces a structured Working Representation — blocks with stable UUIDs rather than raw prose.
06
Critique
Haiku
The draft is scored against the rubric for that (contract type × jurisdiction). Required clauses present? Term and survival consistent? Enforceability red flags? Output is a structured CritiqueReport with a composite score and per-clause findings.
07
Repair
Sonnet
If the critique score is below threshold (or any required clause is missing), the generator runs one — and only one — repair pass against the findings. A second failure surfaces to the user as a warning instead of looping forever. Repair is hard-capped at one iteration to bound cost.

The eval harness

Five layers, most of them free

No single metric catches everything. We stack five cheap signals instead of over-investing in one expensive one.

Deterministic rubric

free

Per (contract type × jurisdiction), a checklist of required clauses evaluated pass/fail by deterministic checks. No model in the loop. Catches roughly eighty percent of failure modes at zero marginal cost and is the backbone of the whole stack.

LLM-as-judge

~$0.01 / eval

Haiku scores each draft on five axes — completeness, legal soundness, clarity, jurisdiction fit, enforceability red flags — and returns a structured report. Runs against every prompt in the golden set on every change.

Golden set diff

free

A curated bank of attorney-reviewed reference prompts, one per representative shape. Drafts are compared against gold on structural properties — required clauses present, jurisdictional markers correct — rather than exact text.

Adversarial prompts

~$0.02 / prompt

Deliberately ambiguous cases: missing party info, conflicting jurisdictions, contradictory terms. The test is whether the generator asks a clarifying question or guesses. The rule of the house is that it should ask. The eval enforces it.

Regression gate

free

On every model or prompt change the full suite runs and scores are compared against the last stored baseline. Any regression greater than 0.3 points fails the build. Improvements ship only if they do not quietly break anything.

Show your work

What the rubric actually checks

One rubric, read in full. Everything below is the literal checklist our critique stage runs against a Mutual NDA under Delaware law. Other rubrics follow the same shape.

Mutual NDA

United States · Delaware

7 required · 2 optional

01
Parties block with addresses
required
Both parties named with full legal name and principal address.
02
Confidential Information definition
required
Non-exhaustive list plus a catch-all "would be understood as confidential" provision.
03
Obligations of receiving party
required
Hold in confidence, do not disclose, do not use beyond purpose, explicit standard of care.
04
Standard carveouts
required
Public knowledge, prior knowledge, independent development, third-party disclosure, legal compulsion.
05
Term and survival
required
Term is explicit and the survival period does not silently extend confidentiality past the requested term.
06
Governing law clause
required
Governing law matches the jurisdiction in the IntentSpec.
07
Remedies and equitable relief
required
Injunction or specific-performance language present, because damages alone rarely cure a breach of confidentiality.
08
Counterparts clause
conditional
Permits counterpart and electronic execution.
09
GDPR overlay (when an EU party is present)
conditional
Personal-data handling declared under GDPR, controller versus processor role noted, Schrems II transfer mechanism referenced when data flows US-bound.

Current coverage

Where the rubrics are today

A filled cell is a shipped, running rubric. A half-filled cell is a rubric under review. A blank cell is an honest gap — we will not pretend to score what we have not measured.

Contract type	US · DE	US · CA	US · NY	UK	EU · DE	EU · FR	ES	AU	CA	IN
Mutual NDA
MSA
Freelance SOW
Employment
Service agreement
SaaS agreement

full partial not yet

Known limitations

What this does not do

Rubric coverage is expanding. Not every (contract type × jurisdiction) pair is scored yet. The matrix above is the honest map — if a pair is blank, it means we have not committed to a rubric we would defend in front of a lawyer.
The deterministic layer is regex-level today. A structural LLM-judge replacement that reasons about clause semantics rather than surface strings is under active development, and will run alongside the regex checks before it replaces them.
Repair is capped at one iteration. A second failure surfaces to the user as a warning instead of looping. This bounds cost and prevents the generator from quietly masking a problem by trying again until the judge stops complaining.
This is not legal advice. A high-score draft is a better starting point than a blank page, not a substitute for a lawyer on anything high-stakes. For material decisions, route the draft through our attorney review marketplace or your own counsel before signing.

Why publish this

Trust is the product

Legal work does not tolerate black boxes. A contract either holds up or it does not, and the reader of that contract — a judge, a counterparty, a future version of you — does not care how clever the generator was. What they care about is whether the clauses are there, whether they say what they should, and whether someone is willing to put their name on the method. We publish this page so that the answer to the last question is yes.

For the long-form engineering write-up of the pipeline, including the trade-offs we considered and the ones we rejected, read the companion post from our engineering team.

Read: How we measure AI contract quality →

Try it against something real.

The methodology above only matters if the output holds up. Draft a contract, run the rubric, read the critique. That is the product.

Draft a contract Read the engineering post →

Published 2026-04-14 · Contracts.io Engineering

How we measure whether our contracts are any good.

If you only read one paragraph

Seven stages, one expensive model, zero guessing

Classify intent

Gate · missing info

Select template

Retrieve clauses

Draft

Critique

Repair

Five layers, most of them free

Deterministic rubric

LLM-as-judge

Golden set diff

Adversarial prompts

Regression gate

What the rubric actually checks

Where the rubrics are today

What this does not do

Trust is the product

Try it against something real.