DocsFlow API

DocsFlow API is a FastAPI-based backend service for uploading, storing and processing documents.

Current MVP scope includes user authentication, document upload, local file storage, asynchronous processing with Celery, processing job tracking, and local text extraction for supported document types.

Tech Stack

Category	Technologies
Language	Python 3.12
Framework	FastAPI
Database	PostgreSQL, SQLAlchemy 2, Alembic
Queue	Celery, Redis
PDF Extraction	PyMuPDF
OCR	Tesseract
Testing	Pytest
Infrastructure	Docker Compose v2

Current Features

Authentication

User registration with email and password
Password hashing with bcrypt
JWT access token authentication
Current user endpoint

Document Upload

Supported file types:

PDF: application/pdf
JPEG: image/jpeg
PNG: image/png

Upload validation includes:

MIME type validation
File signature validation
Empty file rejection
Maximum file size limit
Per-user upload rate limiting
Duplicate document detection by SHA-256 checksum

Processing Modes

Each uploaded document has one processing mode:

Mode	Description
`standard`	Default processing
`confidential`	Guarantees no external AI or third-party API is used

In the current MVP, both modes use local processing only.

Asynchronous Processing

After upload, the API creates a ProcessingJob and sends it to Celery.

Processing job statuses: pending / running / completed / failed

Document statuses: uploaded / processing / completed / failed

Processing includes:

Retry support
Retry delay
Soft and hard task time limits
Error message persistence
Manual reprocess endpoint for failed documents

Local Text Extraction

File type	Method
PDF	Text extracted from PDF text layer via PyMuPDF
JPG / PNG	Text extracted locally via Tesseract OCR

Extracted text is stored in documents.raw_text.

Limitation: scanned PDFs without a text layer are not OCR-processed yet. OCR fallback for scanned PDF pages should be implemented as a separate improvement.

Project Structure

app/
  api/
    v1/
      routes_auth.py
      routes_documents.py
      routes_users.py
  core/
    config.py
  db/
    session.py
    base.py
  models/
    user.py
    document.py
    processing_job.py
  schemas/
  services/
    uploads.py
    storage.py
    processing_jobs.py
    text_extraction.py
  tasks/
    documents.py
  worker.py
  main.py

alembic/
tests/
docker-compose.yml
Dockerfile

Local Setup

Create a local environment file:

cp .env.example .env

Start the stack:

docker compose up --build

Resource	URL
API	http://localhost:8000
Interactive docs	http://localhost:8000/docs

Health check:

curl http://localhost:8000/health
# {"status":"ok"}

Database Migrations

Automatic Migrations on Startup

The API container runs database migrations automatically on startup.

On every API start, the entrypoint script executes:

alembic upgrade head

If the database is empty, all migrations are applied before the FastAPI server starts. If the database is already up to date, Alembic exits without changes.

The migration startup script is located at:

scripts/start-api.sh

The API service uses this script as its startup command:

command: sh scripts/start-api.sh

Only the API container runs migrations. The Celery worker does not run migrations to avoid concurrent migration execution.

Manual Commands

# Check current migration
docker compose exec api alembic current

# Inspect tables
docker compose exec db psql -U docsflow -d docsflow -c "\dt"

For a clean local start:

docker compose down -v
docker compose up --build

After startup, expected tables:

users
documents
processing_jobs
openai_usage_logs
alembic_version

Authentication Flow

Register a user:

curl -X POST http://localhost:8000/api/v1/users/register \
  -H "Content-Type: application/json" \
  -d '{
    "email": "test@example.com",
    "password": "strong-password"
  }'

Login and store the access token:

TOKEN=$(curl -s -X POST http://localhost:8000/api/v1/auth/login \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "username=test@example.com" \
  -d "password=strong-password" | jq -r ".access_token")

Check current user:

curl http://localhost:8000/api/v1/users/me \
  -H "Authorization: Bearer $TOKEN"

Upload a Document

Standard mode:

curl -X POST http://localhost:8000/api/v1/documents/upload \
  -H "Authorization: Bearer $TOKEN" \
  -F "file=@/path/to/document.pdf;type=application/pdf" \
  -F "confidential=false"

Confidential mode:

curl -X POST http://localhost:8000/api/v1/documents/upload \
  -H "Authorization: Bearer $TOKEN" \
  -F "file=@/path/to/document.pdf;type=application/pdf" \
  -F "confidential=true"

Example response:

{
  "id": 1,
  "original_filename": "document.pdf",
  "status": "uploaded",
  "processing_mode": "standard",
  "content_type": "application/pdf",
  "file_size_bytes": 24943,
  "checksum_sha256": "...",
  "created_at": "...",
  "updated_at": "..."
}

Documents API

# List documents owned by the current user
curl http://localhost:8000/api/v1/documents \
  -H "Authorization: Bearer $TOKEN"

# Get one document
curl http://localhost:8000/api/v1/documents/1 \
  -H "Authorization: Bearer $TOKEN"

# List processing jobs for a document
curl http://localhost:8000/api/v1/documents/1/jobs \
  -H "Authorization: Bearer $TOKEN"

# Reprocess a failed document
curl -X POST http://localhost:8000/api/v1/documents/1/reprocess \
  -H "Authorization: Bearer $TOKEN"

Reprocessing is allowed only for documents with status failed.

Check Extracted Text

The public document response does not expose raw_text.

For local development, inspect extracted text directly in PostgreSQL:

# Preview raw text
docker compose exec db psql -U docsflow -d docsflow \
  -P pager=off \
  -c "SELECT id, status, processing_mode, left(raw_text, 700) AS raw_text_preview FROM documents ORDER BY id DESC LIMIT 5;"

# Check text length
docker compose exec db psql -U docsflow -d docsflow \
  -P pager=off \
  -c "SELECT id, original_filename, status, processing_mode, length(raw_text) AS raw_text_length FROM documents ORDER BY id DESC LIMIT 5;"

Testing

# Run the full test suite
docker compose run --rm api pytest

# Run specific test groups
docker compose run --rm api pytest tests/test_documents_upload.py
docker compose run --rm api pytest tests/test_processing_jobs.py
docker compose run --rm api pytest tests/test_text_extraction.py

Useful Development Commands

# View API logs
docker compose logs -f api

# View Celery worker logs
docker compose logs -f celery_worker

# Open a database shell
docker compose exec db psql -U docsflow -d docsflow

# Stop the stack
docker compose down

# Stop the stack and remove volumes
docker compose down -v

Environment Variables

Main settings are configured through .env.

Variable	Default	Description
`APP_ENV`	`local`	Environment name
`APP_DEBUG`	`true`	Debug mode
`APP_SECRET_KEY`	—	Change in production
`ACCESS_TOKEN_EXPIRE_MINUTES`	`30`	JWT expiry
`DATABASE_URL`	`postgresql+psycopg://...`	PostgreSQL connection
`CORS_ORIGINS`	`["http://localhost:8000"]`	Allowed origins
`UPLOAD_MAX_FILE_SIZE_MB`	`10`	Max upload size
`UPLOAD_RATE_LIMIT_REQUESTS`	`10`	Rate limit count
`UPLOAD_RATE_LIMIT_WINDOW_SECONDS`	`60`	Rate limit window
`LOCAL_STORAGE_PATH`	`storage`	Local file storage path
`LOCAL_OCR_LANGUAGES`	`eng+deu`	Tesseract languages
`CELERY_BROKER_URL`	`redis://redis:6379/0`	Celery broker
`CELERY_RESULT_BACKEND`	`redis://redis:6379/1`	Celery backend
`CELERY_TASK_ALWAYS_EAGER`	`false`	Run tasks synchronously
`DOCUMENT_PROCESSING_SOFT_TIME_LIMIT_SECONDS`	`60`	Soft task limit
`DOCUMENT_PROCESSING_HARD_TIME_LIMIT_SECONDS`	`90`	Hard task limit
`DOCUMENT_PROCESSING_MAX_RETRIES`	`3`	Max retry attempts
`DOCUMENT_PROCESSING_RETRY_DELAY_SECONDS`	`10`	Delay between retries

AI Extraction for Standard Documents

DocsFlow supports AI-based structured extraction for documents uploaded in standard mode.

Processing flow for standard documents:

upload
→ local text extraction
→ AI document classification
→ AI structured JSON extraction
→ Pydantic validation
→ save extracted JSON
→ save OpenAI usage log
→ mark document as completed

AI processing is executed only after local text extraction has completed successfully.

Supported AI Processing Scope

Document type classification
Structured data extraction to JSON
Pydantic validation of the AI response
OpenAI token usage logging

Storage fields:

Field	Table
Extracted JSON	`documents.ai_extracted_data`
Document type	`documents.document_type`
Model used	`documents.ai_extraction_model`
Completion timestamp	`documents.ai_extraction_completed_at`
Token usage	`openai_usage_logs`

Processing Modes

AI extraction is only available for documents uploaded with confidential=false.

Documents uploaded with confidential=true are processed locally only and never sent to OpenAI. This is an intentional security boundary.

Required Environment Variables

OPENAI_API_KEY=your-openai-api-key
OPENAI_MODEL=gpt-4o-mini
OPENAI_REQUEST_TIMEOUT_SECONDS=45
OPENAI_MAX_INPUT_CHARS=12000

If OPENAI_API_KEY is missing, standard document processing will fail during the AI extraction step. Confidential documents do not require an OpenAI API key.

Example: Upload a Standard Document

curl -X POST http://localhost:8000/api/v1/documents/upload \
  -H "Authorization: Bearer $TOKEN" \
  -F "file=@/path/to/document.pdf;type=application/pdf" \
  -F "confidential=false"

After processing, the document response includes AI extraction fields:

{
  "id": 1,
  "status": "completed",
  "processing_mode": "standard",
  "document_type": "invoice",
  "ai_extracted_data": {
    "document_type": "invoice",
    "summary": "Invoice for services.",
    "sender": "Example Company",
    "recipient": "Customer Name",
    "document_date": "2026-01-15",
    "due_date": "2026-02-15",
    "total_amount": 950.0,
    "currency": "EUR",
    "invoice_number": "INV-001",
    "reference_number": null,
    "requires_action": true,
    "action_deadline": "2026-02-15",
    "confidence_score": 0.95,
    "notes": null
  },
  "ai_extraction_model": "gpt-4o-mini",
  "ai_extraction_completed_at": "2026-06-03T17:47:42.746233Z"
}

Check OpenAI Usage Logs

docker compose exec db psql -U docsflow -d docsflow \
  -P pager=off \
  -c "SELECT document_id, operation, model, input_tokens, output_tokens, total_tokens FROM openai_usage_logs ORDER BY id DESC LIMIT 5;"

Example result:

 document_id |       operation        |    model    | input_tokens | output_tokens | total_tokens
-------------+------------------------+-------------+--------------+---------------+--------------
           1 | document_ai_extraction | gpt-4o-mini |          844 |           100 |          944

Check Extracted AI Data

docker compose exec db psql -U docsflow -d docsflow \
  -P pager=off \
  -c "SELECT id, document_type, ai_extracted_data FROM documents ORDER BY id DESC LIMIT 1;"

Current Limitations

raw_text is stored internally but not exposed through the public API
Scanned PDF OCR fallback is not implemented yet
Uploaded files are stored on the local filesystem
Upload rate limiting is in-memory and not shared between multiple API instances
AI extraction is available only for standard mode — confidential documents are never sent to OpenAI
AI extraction depends on successful local text extraction
Scanned PDFs without a text layer require OCR fallback before AI extraction can work well
Extracted JSON schema is generic and will be refined in later MVP steps
Semantic search and document Q&A are planned for later MVP stages
standard-mode AI extraction requires OPENAI_API_KEY
scanned PDF files without a text layer may produce empty raw_text until OCR fallback is implemented

Contacts

Author: Maksym Petrykin

Email: m.petrykin@gmx.de

Telegram: @max_p95

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
alembic		alembic
app		app
data		data
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
alembic.ini		alembic.ini
dev_checklist.md		dev_checklist.md
docker-compose.yml		docker-compose.yml
notes.md		notes.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocsFlow API

Tech Stack

Current Features

Authentication

Document Upload

Processing Modes

Asynchronous Processing

Local Text Extraction

Project Structure

Local Setup

Database Migrations

Automatic Migrations on Startup

Manual Commands

Authentication Flow

Upload a Document

Documents API

Check Extracted Text

Testing

Useful Development Commands

Environment Variables

AI Extraction for Standard Documents

Supported AI Processing Scope

Processing Modes

Required Environment Variables

Example: Upload a Standard Document

Check OpenAI Usage Logs

Check Extracted AI Data

Current Limitations

Contacts

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocsFlow API

Tech Stack

Current Features

Authentication

Document Upload

Processing Modes

Asynchronous Processing

Local Text Extraction

Project Structure

Local Setup

Database Migrations

Automatic Migrations on Startup

Manual Commands

Authentication Flow

Upload a Document

Documents API

Check Extracted Text

Testing

Useful Development Commands

Environment Variables

AI Extraction for Standard Documents

Supported AI Processing Scope

Processing Modes

Required Environment Variables

Example: Upload a Standard Document

Check OpenAI Usage Logs

Check Extracted AI Data

Current Limitations

Contacts

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages