How we measure whether our contracts are any good.
Most legal-AI tools ask you to trust the output. We would rather show our work. This page describes, in plain terms, the pipeline our generator runs and the eval harness we grade it with — the same rubric our lawyers maintain and the same numbers our engineers see on every build.
If you only read one paragraph
Every contract our AI generates passes through a seven-stage agentic pipeline. At stage six the draft is scored against a rubric specific to its contract type and jurisdiction — the same rubrics our lawyers write and maintain, codified as deterministic checks a machine can evaluate. Any score below 4.0 out of 5.0 triggers a single repair pass, capped at one iteration to bound cost. Separately, the same rubric suite runs as an eval harness against a curated golden set on every model or prompt change, and any regression of more than 0.3 points fails the build. The result is that the generator is not graded by vibes. It is graded by a ruler, and the ruler is public.
Seven stages, one expensive model, zero guessing
Each stage produces a typed output the next stage can reason about. Cheap classifiers run first; the expensive model runs only when the inputs are ready.
- 01
Classify intent
HaikuThe user prompt is parsed into a typed IntentSpec — contract type, jurisdiction, parties, purpose, term, and any special clauses requested. A cheap classifier that converts free-text into structured input the rest of the pipeline can reason about.
- 02
Gate · missing info
DeterministicBefore any model writes a word, the IntentSpec is checked against the minimum-viable fields for that contract type. Anything missing returns a clarifying question to the user instead of a guess. The generator prefers asking over acting.
- 03
Select template
RetrievalA lawyer-vetted template is looked up from the template library, keyed on (contract type × jurisdiction). If no jurisdiction-specific template exists, the generator falls back to a generic one and records the fallback in provenance.
- 04
Retrieve clauses
Retrieval + vectorJurisdiction-specific clauses are pulled from the clause library: governing law, dispute resolution, data-protection overlays (GDPR when an EU party is present), statutory carveouts. Retrieval is scoped workspace-first, then organization, then global.
- 05
Draft
SonnetThe template is populated with the IntentSpec and the retrieved clauses. This is the only stage that runs the expensive model, and it produces a structured Working Representation — blocks with stable UUIDs rather than raw prose.
- 06
Critique
HaikuThe draft is scored against the rubric for that (contract type × jurisdiction). Required clauses present? Term and survival consistent? Enforceability red flags? Output is a structured CritiqueReport with a composite score and per-clause findings.
- 07
Repair
SonnetIf the critique score is below threshold (or any required clause is missing), the generator runs one — and only one — repair pass against the findings. A second failure surfaces to the user as a warning instead of looping forever. Repair is hard-capped at one iteration to bound cost.
Five layers, most of them free
No single metric catches everything. We stack five cheap signals instead of over-investing in one expensive one.
Deterministic rubric
Per (contract type × jurisdiction), a checklist of required clauses evaluated pass/fail by deterministic checks. No model in the loop. Catches roughly eighty percent of failure modes at zero marginal cost and is the backbone of the whole stack.
LLM-as-judge
Haiku scores each draft on five axes — completeness, legal soundness, clarity, jurisdiction fit, enforceability red flags — and returns a structured report. Runs against every prompt in the golden set on every change.
Golden set diff
A curated bank of attorney-reviewed reference prompts, one per representative shape. Drafts are compared against gold on structural properties — required clauses present, jurisdictional markers correct — rather than exact text.
Adversarial prompts
Deliberately ambiguous cases: missing party info, conflicting jurisdictions, contradictory terms. The test is whether the generator asks a clarifying question or guesses. The rule of the house is that it should ask. The eval enforces it.
Regression gate
On every model or prompt change the full suite runs and scores are compared against the last stored baseline. Any regression greater than 0.3 points fails the build. Improvements ship only if they do not quietly break anything.
What the rubric actually checks
One rubric, read in full. Everything below is the literal checklist our critique stage runs against a Mutual NDA under Delaware law. Other rubrics follow the same shape.
- 01Parties block with addressesrequiredBoth parties named with full legal name and principal address.
- 02Confidential Information definitionrequiredNon-exhaustive list plus a catch-all "would be understood as confidential" provision.
- 03Obligations of receiving partyrequiredHold in confidence, do not disclose, do not use beyond purpose, explicit standard of care.
- 04Standard carveoutsrequiredPublic knowledge, prior knowledge, independent development, third-party disclosure, legal compulsion.
- 05Term and survivalrequiredTerm is explicit and the survival period does not silently extend confidentiality past the requested term.
- 06Governing law clauserequiredGoverning law matches the jurisdiction in the IntentSpec.
- 07Remedies and equitable reliefrequiredInjunction or specific-performance language present, because damages alone rarely cure a breach of confidentiality.
- 08Counterparts clauseconditionalPermits counterpart and electronic execution.
- 09GDPR overlay (when an EU party is present)conditionalPersonal-data handling declared under GDPR, controller versus processor role noted, Schrems II transfer mechanism referenced when data flows US-bound.
Where the rubrics are today
A filled cell is a shipped, running rubric. A half-filled cell is a rubric under review. A blank cell is an honest gap — we will not pretend to score what we have not measured.
| Contract type | US · DE | US · CA | US · NY | UK | EU · DE | EU · FR | ES | AU | CA | IN |
|---|---|---|---|---|---|---|---|---|---|---|
| Mutual NDA | ||||||||||
| MSA | ||||||||||
| Freelance SOW | ||||||||||
| Employment | ||||||||||
| Service agreement | ||||||||||
| SaaS agreement |
What this does not do
Rubric coverage is expanding. Not every (contract type × jurisdiction) pair is scored yet. The matrix above is the honest map — if a pair is blank, it means we have not committed to a rubric we would defend in front of a lawyer.
The deterministic layer is regex-level today. A structural LLM-judge replacement that reasons about clause semantics rather than surface strings is under active development, and will run alongside the regex checks before it replaces them.
Repair is capped at one iteration. A second failure surfaces to the user as a warning instead of looping. This bounds cost and prevents the generator from quietly masking a problem by trying again until the judge stops complaining.
This is not legal advice. A high-score draft is a better starting point than a blank page, not a substitute for a lawyer on anything high-stakes. For material decisions, route the draft through our attorney review marketplace or your own counsel before signing.
Trust is the product
Legal work does not tolerate black boxes. A contract either holds up or it does not, and the reader of that contract — a judge, a counterparty, a future version of you — does not care how clever the generator was. What they care about is whether the clauses are there, whether they say what they should, and whether someone is willing to put their name on the method. We publish this page so that the answer to the last question is yes.
For the long-form engineering write-up of the pipeline, including the trade-offs we considered and the ones we rejected, read the companion post from our engineering team.
Try it against something real.
The methodology above only matters if the output holds up. Draft a contract, run the rubric, read the critique. That is the product.