FOL Chain-of-Thought Verification — a pipeline that verifies logical entailment by translating first-order logic into TPTP format and running an automated theorem prover. Supports gold FOL annotations, LLM-generated FOL, and chain-of-thought verification with per-step proving.
uv sync
brew install eproverFetch the FOLIO dataset:
mkdir -p data
git clone https://github.com/Yale-LILY/FOLIO data/folio
uv run python scripts/download_folio_v2.py # fetches v2 from HuggingFaceFor LLM modes (llm, cot), configure AWS credentials for Bedrock access (e.g. aws sso login or ~/.aws/credentials).
Run against gold FOL annotations from the FOLIO dataset:
uv run python -m cotlog.eval --mode goldHave an LLM translate natural language to FOL, then verify via E-prover:
uv run python -m cotlog.eval --mode llm --limit 10 -vRun CoT verification over FOLIO examples and report verification statistics (not accuracy — CoT checks internal reasoning consistency, not agreement with gold labels):
uv run python -m cotlog.eval --mode cot --limit 10 -vVerify any natural language argument, independent of the FOLIO dataset:
# From CLI flags
uv run python -m cotlog.cot \
--premise "All humans are mortal." \
--premise "Socrates is human." \
--conclusion "Socrates is mortal."
# From a JSON file
uv run python -m cotlog.cot input.json
# Machine-readable output
uv run python -m cotlog.cot --json \
--premise "All humans are mortal." \
--premise "Socrates is human." \
--conclusion "Socrates is mortal."The JSON file format is {"premises": ["..."], "conclusion": "..."}. Use --max-retries N to control feedback loop iterations, --model to pick a model, and -v for verbose step reasoning.
Uses formalization as a diagnostic tool to surface ambiguities in natural language, then refines the premises to remove them. See REFINE.md for the full proposal.
Run on the included examples:
uv run python -m cotlog --mode refine --data examples/refine-examples.jsonl -vOr run the standalone demo:
uv run python examples/refine_demo.pyTo add new examples, append to examples/refine-examples.jsonl — same format as FOLIO (the premises-FOL and conclusion-FOL fields can be empty since the loop generates its own):
{"premises": ["...", "..."], "premises-FOL": [], "conclusion": "...", "conclusion-FOL": "", "label": "True"}Refine-specific options:
--max-iterations N Max refinement iterations (default: 3)
--stability-n N Independent formalizations per stability check (default: 5)
--stability-threshold F Structural agreement threshold to stop (default: 0.9)
Verifies whether FOL formalizations faithfully capture the original NL by informalizing the FOL back to English and comparing. See CLAIMCHECK.md for details and results.
# Check LLM-generated FOL against original NL
uv run python -m cotlog.claimcheck --mode llm --data examples/refine-examples.jsonl -v
# Check FOLIO v1 gold FOL annotations against original NL
uv run python -m cotlog.claimcheck --mode gold --limit 20 -v
# Check FOLIO v2 gold FOL annotations
uv run python -m cotlog.claimcheck --mode gold --dataset v2 -vThe --dataset flag selects between FOLIO versions: v1 (original, 204 examples) or v2 (revised, 203 examples). Defaults to v1 if omitted.
--mode MODE gold (default), llm, cot, or refine
--model NAME LLM model: sonnet (default), haiku, opus, or a full model ARN
--limit N Evaluate only the first N examples
--cpu-limit N E-prover CPU time limit per problem (default: 30s)
--output-dir DIR Directory for result files (default: results/)
-v, --verbose Print per-example results
Each run writes two files to results/ (or the directory specified by --output-dir):
results/{mode}_{YYYYMMDD_HHMMSS}.jsonl — one JSON object per example
results/{mode}_{YYYYMMDD_HHMMSS}.txt — human-readable summary report
The JSONL file contains per-example records with timing, error info, and mode-specific fields. Inspect with jq:
# Gold/LLM: find mismatches
cat results/llm_*.jsonl | jq 'select(.correct == false)'
# CoT: find examples with unverified steps
cat results/cot_*.jsonl | jq 'select(.all_steps_verified == false) | {index, conclusion, steps: [.steps[] | select(.verified == false) | {step_num, error, fol_str}]}'FOL Parser (src/cotlog/fol_parser.py) — tokenizes and parses FOLIO's Unicode FOL notation (∀ ∃ → ∧ ∨ ¬ ⊕ ↔) into an AST via recursive descent.
TPTP Generator (src/cotlog/tptp.py) — renders the AST to TPTP FOF syntax for E-prover (∀→!, ∃→?, →→=>, etc.).
Prover (src/cotlog/prover.py) — runs E-prover and implements three-way entailment checking:
- True: conjecture as-is is a
Theorem - False: negated conjecture is a
Theorem - Uncertain: neither is provable
LLM Client (src/cotlog/llm.py) — Claude via Bedrock. Configured through AWS_REGION and CLAUDE_MODEL env vars.
FOL Generation (src/cotlog/fol_gen.py) — prompts the LLM with few-shot examples to translate NL premises/conclusion into FOL, then feeds the result through the standard prover pipeline.
CoT Verification (src/cotlog/cot_verify.py) — multi-turn pipeline: the LLM formalizes all premises and the conclusion into FOL using its own consistent vocabulary, then reasons step-by-step with a FOL formula per step. Each step is verified against the LLM's formalized premises plus previously verified steps, and the conclusion is verified against the full accumulated knowledge. When steps fail, prover errors are fed back to the LLM for revision (up to 2 retries by default). No gold FOL is needed.
Claimcheck (src/cotlog/claimcheck.py) — round-trip faithfulness verification: formalizes NL to FOL (or uses gold annotations), informalizes the FOL back to English without seeing the original, then compares the two English texts to find discrepancies. See CLAIMCHECK.md.
Refinement Loop (src/cotlog/refine.py) — experimental iterative refinement of NL premises via formalization. See REFINE.md for the proposal and lessons learned.
uv run python -m pytest tests/Gold FOL on FOLIO validation (204 examples): 91.2% accuracy (186/204 correct, 8 parse errors from malformed data).
Running claimcheck on FOLIO's gold FOL annotations found formalization errors in both v1 and v2 of the dataset. Ten error patterns are documented in FOLIO_REPORT.md, with machine-readable evidence in results/FOLIO_EVIDENCE.jsonl and a human-reviewed checklist in folio_review.txt.
To reproduce:
# Generate evidence from v1 and v2 datasets
uv run python scripts/generate_evidence.py
# Mechanical verification (checks NL/FOL against actual datasets)
uv run python scripts/verify_evidence.py
# Human review checklist
uv run python scripts/verify_evidence.py --human- No equality support: The FOL parser doesn't handle
=or≠. LLM-generated steps using equality will fail to parse. - Uncertain conclusions can't be "proved": When the correct answer is Uncertain, the prover returns Uncertain by design (neither the conjecture nor its negation is provable). These aren't verification failures — they're correct, but they don't count as "fully verified" in the current reporting.
- LLM may assume unstated facts: Steps like
Employee(james)that assert facts not in the premises are legitimately unverifiable by the prover.
See DESIGN.md for full architectural details.