I have been trying to get LLMs to write formal verification proofs for about two years. Most recently as part of my work proving C code from bitcoin-core/secp256k1 correct. Every attempt failed. Different models through Copilot Chat would produce plausible-looking tactic sequences that did not type-check, hallucinate lemma names, or get stuck in loops. I kept trying because the idea is so compelling: as others have predicted, formal verification and AI form a natural synergy. Proofs are machine-checkable, so an LLM can try things and get immediate feedback on whether they work.

It finally worked.

What changed

The key was the setup. Anthropic’s Opus 4.6 via Claude Code, equipped with rocq-mcp and a sandboxed Python runtime, was the first combination that could prove lemmas fully autonomously. rocq-mcp exposes Rocq’s proof state through the Model Context Protocol, letting the agent try a tactic, inspect the resulting goal, search for lemmas, and retry iteratively. This tight feedback loop is what makes it work: the model does not have to get it right on the first try.

The same model through Copilot performed significantly worse. I can only speculate, but I believe Claude Code’s context management is different.

While writing this post I came across a paper released two weeks ago: a similar setup using the same rocq-mcp tooling was used by Baudart et al. (2026) to autonomously prove 10 out of 12 problems from the 2025 Putnam Mathematical Competition in Rocq. These are genuinely hard competition problems, far beyond what I needed. Their system used Claude Opus 4.6 with 141 subagents over 17.7 hours of compute, consuming roughly 1.9 billion tokens. This demonstrates that the MCP-based feedback loop scales to much harder problems when given enough resources.

The lemmas it proved for my project are not difficult. These are straightforward modular arithmetic facts, each requiring only 3-4 tactics:

(** General modular folding: subtracting multiples of [B] inside 
    a product does not change the residue. *)
Lemma Zmod_mul_sub_base : forall a b x B : Z,
  (a + b * (x - B)) mod B = (a + b * x) mod B.
Proof.
  intros a b x B.
  replace (a + b * (x - B)) with (a + b * x + (-b) * B) by ring.
  rewrite Z_mod_plus_full.
  reflexivity.
Qed.

(** Adding [y mod B] and [z mod B] instead of [y] and [z] does not
    change the result modulo [B]. *)
Lemma Zmod_add_mod_idemp : forall B x y z,
  B > 0 ->
  (x + y mod B + z mod B) mod B = (x + y + z) mod B.
Proof.
  intros B x y z HB.
  rewrite (Z_div_mod_eq_full y B) at 2.
  rewrite (Z_div_mod_eq_full z B) at 2.
  replace (x + (B * (y / B) + y mod B) + (B * (z / B) + z mod B))
    with (x + y mod B + z mod B + (y / B + z / B) * B) by ring.
  rewrite Z_mod_plus_full.
  reflexivity.
Qed.

The value is not that these lemmas are hard, but that the agent proved without any human guidance. Opus performed well on these pure Z (integer arithmetic) lemmas but still poorly on VST proofs, which is to be expected given the limited training data for VST-specific tactics. It would be exciting to see if this approach can be extended to formal verification of C programs as well.

Proof cleanup and practical tips

Beyond proving new lemmas, LLMs also proved useful for proof cleanup. My workflow tends to be messy: I push through a large proof with copy-pasted sections, inconsistent naming, and poor structure until I reach Qed, then clean up afterwards. LLMs are well suited to this. They can identify repeating patterns in a thousand-line proof script, extract helper lemmas, and apply consistent naming.

By default, proofs also tend to use unnecessarily complex syntax, which I solved reasonably well by adding a STYLE.md file.

A modular setup with only a single or few proofs per file not only improves recompile time but also helps Claude handle context better. rocq-mcp with coq-lsp does not work reliably with large proof files, so I instructed agents to use scratch files if necessary.

AI and Formal Verification

2026-04-04

What changed

Proof cleanup and practical tips