Author: Bjorn Melin
Local-first tool that ingests OpenAI Codex CLI histories (~/.codex/**), deduplicates and clusters related prompts, and auto-synthesizes atomic and workflow prompts via the OpenAI Responses API with Structured Outputs. Stores everything in one SQLite database with FTS5 keyword search and sqlite-vec semantic search. Ships a minimal Streamlit UI and a Typer CLI. Optional systemd timer for daily automation on WSL2.
- Codex CLI: https://github.com/openai/codex • Product page: https://developers.openai.com/codex/cli/
- Responses API: https://platform.openai.com/docs/api-reference/responses • Structured Outputs: https://platform.openai.com/docs/guides/structured-outputs
- Embeddings: https://platform.openai.com/docs/guides/embeddings • text-embedding-3-small: https://openai.com/index/new-embedding-models-and-api-updates/
- SQLite FTS5: https://www.sqlite.org/fts5.html • sqlite-vec: https://alexgarcia.xyz/sqlite-vec/python.html
- Streamlit: https://docs.streamlit.io/develop/api-reference • Typer: https://typer.tiangolo.com/ • RapidFuzz: https://rapidfuzz.github.io/RapidFuzz/
- WSL2 systemd: https://learn.microsoft.com/windows/wsl/systemd • systemd timers: https://www.freedesktop.org/software/systemd/man/systemd.timer.html
- Codex Prompt Refinery
- Ingest tolerant JSON/JSONL from
~/.codex/**; pairsuser → assistantwithin session; stores raw JSON for provenance. - Canonicalize & dedupe with SHA-256 and RapidFuzz near-dup gate.
- Hybrid search: FTS5 for keywords + sqlite-vec for vector KNN.
- Cluster related prompts per day by cosine ≥ 0.86.
- Synthesize atomic/workflow prompts using OpenAI Responses API + Structured Outputs.
- UI to browse, specialize, copy, and download prompts.
- Server-side filtering (SQLite): date, roles, sessions, kind (optimized).
- Saved views persisted in SQLite with shareable URLs (
?view=<id>). - Table and Charts tabs; optional AgGrid via
ENABLE_AGGRID=1with automatic fallback. - CLI via Typer.
- Optional daily automation with systemd on WSL2.
- Local only. Single SQLite file (
~/.pdr.sqliteby default). - Libraries first: SQLite FTS5 and sqlite-vec for search; Streamlit for UI; Typer for CLI.
- No servers. No external vector DBs or agents.
flowchart TD
A[Codex JSON/JSONL]
B[SQLite: prompt_raw, FTS5, vec0]
C[Clusters]
D[prompt_optimized, links, embeddings]
E[UI / CLI]
A -->|ingest| B
B -->|cluster| C
C -->|synthesize via Responses API| D
D -->|search| E
CREATE TABLE prompt_raw (
id INTEGER PRIMARY KEY,
source_path TEXT NOT NULL,
session_id TEXT,
conversation_id TEXT,
role TEXT CHECK(role IN ('user','assistant','system')),
ts TEXT,
text TEXT NOT NULL,
canonical_hash TEXT NOT NULL,
raw_json TEXT
);
CREATE TABLE prompt_optimized (
id INTEGER PRIMARY KEY,
kind TEXT CHECK(kind IN ('atomic','workflow')) NOT NULL,
title TEXT,
text_md TEXT NOT NULL,
variables_json TEXT,
io_contract_json TEXT,
rationale TEXT,
created_at TEXT,
cluster_hint TEXT,
gpt_meta_json TEXT
);
CREATE TABLE prompt_link (
optimized_id INTEGER,
raw_id INTEGER,
PRIMARY KEY (optimized_id, raw_id)
);
CREATE VIRTUAL TABLE prompt_raw_fts USING fts5(text);
CREATE VIRTUAL TABLE prompt_opt_fts USING fts5(text_md);
CREATE VIRTUAL TABLE embeddings USING vec0(
embedding FLOAT[1536],
kind TEXT CHECK(kind IN ('raw','optimized')),
item_id INTEGER
);- Python 3.11+
- OpenAI API key (for embeddings + synthesis):
OPENAI_API_KEY - sqlite-vec Python package is auto-loaded; vector search degrades gracefully if unavailable
- WSL2 Ubuntu recommended; zsh or bash
# create venv
uv venv && source .venv/bin/activate
# install (dev + optional UI components)
uv pip install -e ".[dev]"
uv pip install st-copy-to-clipboard streamlit-ace # optional-
Database: environment variable
PDR_DB(default~/.pdr.sqlite). -
Codex CLI histories: defaults to glob
~/.codex/**(JSON/JSONL).- Codex repo: https://github.com/openai/codex • Docs: https://developers.openai.com/codex/cli/
- Config path (varies):
~/.codex/config.tomlor~/.codex/config.jsondepending on version.
-
Embeddings: model
text-embedding-3-small(1536-D). Optional--dimsoverride. -
Synthesis: default model
gpt-5-mini; use--model gpt-5for hard clusters.
# 1) Ingest recent Codex logs
pdr ingest --since 1
# 2) Synthesize prompts for today
pdr synthesize --date "$(date -I)" --model gpt-5-mini
# 3) Launch UI
pdr ui --port 8501# Ingest
pdr ingest --date 2025-09-07
pdr ingest --since 3 --path "~/.codex/**" --path "~/Downloads/codex-exports/*.jsonl"
# Synthesize
pdr synthesize --date 2025-09-07 --model gpt-5
pdr synthesize --date 2025-09-07 --dims 1536
# UI
PDR_DB=~/.pdr.sqlite pdr ui --port 8501- Single page with search input and a sidebar for Filters and Views.
- Tabs: Optimized, Raw, Table, Charts.
- Cards show code-fenced markdown, “Specialize” text area, live preview, Copy, and Download.
- Streamlit docs: https://docs.streamlit.io/develop/api-reference
- Optional components:
- Copy button: https://github.com/mmz-001/st-copy-to-clipboard
- Editor (Ace): https://github.com/okld/streamlit-ace
- Editor (react-based): https://github.com/bouzidanas/streamlit-code-editor
- Default uses native Streamlit rendering; no extra deps.
- Optional AgGrid table: set
ENABLE_AGGRID=1and install:
uv pip install streamlit-aggrid pandas
ENABLE_AGGRID=1 pdr ui --port 8501If AgGrid is missing or errors, the UI automatically falls back to the native table.
- Minimal counts derived from the same filtered results.
- Filters are applied server-side in SQLite for performance and correctness.
- Facets: date range (default last 30 days), roles, sessions, kind (optimized).
- Saved views are persisted in the
ui_viewtable (versioned JSON); views can be shared via URLs like?view=<id>. - Optional
PUBLIC_BASE_URLcan be set to show absolute share links in the sidebar.
- Keyword:
prompt_raw_ftsandprompt_opt_ftsvia FTS5, with filter predicates joined to source tables. - Semantic: sqlite-vec
vec0KNN using query embedding, then post-filtered with the same predicates. - Combined and de-duplicated with simple scoring.
- FTS5: https://www.sqlite.org/fts5.html • sqlite-vec: https://alexgarcia.xyz/sqlite-vec/python.html
- Sends 3–10 related
userprompts plus selectedassistantsnippets to the Responses API using Structured Outputs with a JSON Schema. - Validates response before insert; embeds optimized prompts for semantic search.
- Responses API: https://platform.openai.com/docs/api-reference/responses
- Structured Outputs: https://platform.openai.com/docs/guides/structured-outputs
Enable systemd in WSL:
-
Edit
/etc/wsl.conf:[boot] systemd=true
-
Run
wsl.exe --shutdownfrom Windows PowerShell. -
User services: copy systemd unit files to
~/.config/systemd/user/, then:
systemctl --user daemon-reload
systemctl --user enable --now codex-prompt-refinery.timer
systemctl --user list-timers | grep codex-prompt-refinery- WSL systemd: https://learn.microsoft.com/windows/wsl/systemd
- systemd timers: https://www.freedesktop.org/software/systemd/man/systemd.timer.html
- ruff:
ruff format . && ruff check . --fix - pylint:
pylint --fail-under=9.5 src/pdr tests - pytest + coverage (≥ 80%):
make dev && make test - Synthesis JSON must match the bundled schema (see PRD.md).
See docs/developers/testing.md for a detailed guide.
Quick run:
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
make test # runs pytest with coverage and fail-under 80- No vector table: Ensure
sqlite-vecis importable in Python; otherwise vector search is skipped and FTS still works. - Codex paths: History locations vary across versions; pass explicit
--pathglobs if needed. - API errors: Check
OPENAI_API_KEY. Reduce--modelor cluster size. - Permission errors on WSL2 timers: Confirm systemd is enabled and user units are in
~/.config/systemd/user/. - Share URL not absolute: Set
PUBLIC_BASE_URLto your UI base (e.g.,https://host/app). - AgGrid missing or errors: Unset
ENABLE_AGGRIDor installstreamlit-aggrid+pandas; fallback is automatic.
- Redacts common secrets (
sk-..., AWSAKIA..., long numeric tokens) before storage and API calls. - All data stays local. No network calls except OpenAI API during embeddings/synthesis.
- Open issues and PRs are welcome.
- Run
make lintandmake testbefore submitting. - Keep changes KISS/DRY/YAGNI. Add references to official docs where relevant.
- Roadmap: Import/export
.mdbundles; richer clustering controls; per-project profiles. - Non-Goals: Multi-user SaaS, external vector DBs, background daemons, auth/RBAC.
@software{codex_prompt_refinery_2025,
title = {codex-prompt-refinery},
author = {Melin, Bjorn},
year = {2025},
note = {Local-first Codex CLI prompt refinery with SQLite FTS5 and sqlite-vec},
version = {0.1.1}
}- Codex CLI repo: https://github.com/openai/codex and product page: https://developers.openai.com/codex/cli/
- OpenAI Responses API: https://platform.openai.com/docs/api-reference/responses
- Structured Outputs: https://platform.openai.com/docs/guides/structured-outputs
- Embeddings: https://platform.openai.com/docs/guides/embeddings
- sqlite-vec docs: https://alexgarcia.xyz/sqlite-vec/python.html • repo: https://github.com/asg017/sqlite-vec
- SQLite FTS5: https://www.sqlite.org/fts5.html
- Streamlit: https://docs.streamlit.io/develop/api-reference
- Typer: https://typer.tiangolo.com/
- RapidFuzz: https://rapidfuzz.github.io/RapidFuzz/
- WSL systemd: https://learn.microsoft.com/windows/wsl/systemd
- systemd timers: https://www.freedesktop.org/software/systemd/man/systemd.timer.html
Also see the project PRD: PRD.md and example prompts under examples/.
Released under the MIT License. See LICENSE.