Skip to content

Commit 51c6e2f

Browse files
authored
Add a Docker Compose Document Engine example for large documents (#190)
1 parent bee24ab commit 51c6e2f

9 files changed

Lines changed: 604 additions & 1 deletion

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,7 @@ Clone and run these complete examples on your machine:
8484

8585
### Document Engine
8686
Server-side PDF processing and manipulation:
87+
- [Docker Compose Deployment](./document-engine/de-docker-compose)
8788
- [AWS ECS Deployment with Terraform](./document-engine/de-aws-ecs-terraform)
8889
- [Orbstack Kubernetes Deployment](./document-engine/de-orbstack-kubernetes)
8990

document-engine/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,11 @@
44
55
## Available examples
66

7+
### [de-docker-compose](de-docker-compose)
8+
Demonstrates minimal installation of Document Engine on a local Docker Compose stack using PostgreSQL, MinIO, and Caddy.
9+
710
### [de-aws-ecs-terraform](de-aws-ecs-terraform)
811
Demonstrates minimal installation of Document Engine on AWS using Terraform and Elastic Container Service (ECS).
912

1013
### [de-orbstack-kubernetes](de-orbstack-kubernetes)
1114
Demonstrates minimal installation of Document Engine on a local Kubernetes cluster using OrbStack.
12-
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Reverse proxy for local big-file testing that exercises the shared rendering
2+
# process without requiring TLS.
3+
#
4+
# The shared rendering feature (HTTP2_SHARED_RENDERING_PROCESS_ENABLE) reuses a
5+
# single pspdfkitd worker checkout across multiple tile requests that arrive as
6+
# HTTP/2 streams on the same Cowboy connection. The worker is cached in the
7+
# Cowboy connection process's dictionary, so multiplexing matters: requests on
8+
# the same HTTP/2 connection share a worker, requests on separate connections
9+
# do not.
10+
#
11+
# TLS is not needed for this to work. Caddy's h2c reverse_proxy uses Go's
12+
# http.Transport, which maintains a pool of HTTP/2 connections to the upstream
13+
# and multiplexes requests as streams onto them — regardless of whether the
14+
# browser-to-Caddy leg is HTTP/1.1 or HTTP/2. Multiple browser HTTP/1.1
15+
# requests are funnelled onto a single h2c upstream connection, giving Cowboy
16+
# the HTTP/2 stream multiplexing it needs.
17+
#
18+
# Non-tile traffic (uploads, dashboard, API) stays on HTTP/1.1 upstream to
19+
# avoid Cowboy's HTTP/2 reset-rate guard on large request bodies.
20+
{
21+
auto_https off
22+
}
23+
24+
:5000 {
25+
request_body {
26+
max_size 50GB
27+
}
28+
29+
# Match the tile-rendering route that needs an upstream HTTP/2 hop to exercise the shared rendering process code path.
30+
@render_tiles path_regexp render_tiles ^/i/d/[^/]+/h/[^/]+/page-.*
31+
32+
handle @render_tiles {
33+
reverse_proxy h2c://document-engine:5000
34+
}
35+
36+
handle {
37+
reverse_proxy http://document-engine:5000 {
38+
transport http {
39+
versions 1.1
40+
response_header_timeout 15m
41+
}
42+
}
43+
}
44+
}
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
The Nutrient Sample applications are licensed with a modified BSD
2+
license. In plain language: you're allowed to do whatever you wish
3+
with the code, modify, redistribute, embed in your products (free or
4+
commercial), but you must include copyright, terms of usage and
5+
disclaimer as stated in the license.
6+
7+
You will require a commercial Nutrient License to run these examples
8+
in non-demo mode. Please refer to sales@nutrient.io for details.
9+
10+
Copyright © 2017-present PSPDFKit GmbH d/b/a Nutrient.
11+
All rights reserved.
12+
13+
Redistribution and use in source or binary forms,
14+
with or without modification, are permitted provided
15+
that the following conditions are met:
16+
17+
- Redistributions of source code must retain the above copyright
18+
notice, this list of conditions and the following disclaimer.
19+
20+
- Redistributions in binary form must reproduce the above copyright
21+
notice, this list of conditions and the following disclaimer in the
22+
documentation and/or other materials provided with the
23+
distribution.
24+
25+
- Redistributions of Nutrient Samples must include attribution to
26+
Nutrient, either in documentation or other appropriate media.
27+
28+
- Neither the name of the Nutrient, PSPDFKit GmbH, nor its developers
29+
may be used to endorse or promote products derived from
30+
this software without specific prior written permission.
31+
32+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
33+
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
34+
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
35+
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
36+
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
37+
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
38+
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
39+
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
40+
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
41+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
42+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# Document Engine example — deploying with Docker Compose and Caddy
2+
3+
- [Prerequisites](#prerequisites)
4+
- [Getting started](#getting-started)
5+
- [Configuration](#configuration)
6+
- [API usage examples](#api-usage-examples)
7+
- [Cleanup](#cleanup)
8+
- [Support](#support)
9+
- [License](#license)
10+
- [Contributing](#contributing)
11+
12+
> This example demonstrates a local [Nutrient Document Engine](https://www.nutrient.io/guides/document-engine/) deployment using Docker Compose, PostgreSQL, MinIO, and Caddy.
13+
14+
This stack is intended for local development and manual testing of large uploads. It includes a `README`, a `setup.sh`, a dedicated configuration file, and a sample document for smoke tests.
15+
16+
## Prerequisites
17+
18+
- A Docker-compatible runtime with `docker compose` support
19+
- `curl`
20+
21+
## Getting started
22+
23+
1. Run the automated setup script:
24+
25+
```bash
26+
./setup.sh
27+
```
28+
29+
2. Wait for the script to report that the stack is healthy, then open:
30+
31+
```text
32+
http://localhost:5000/dashboard
33+
```
34+
35+
Default credentials:
36+
37+
- Username: `admin`
38+
- Password: `admin`
39+
40+
The setup script starts four services:
41+
42+
- PostgreSQL for Document Engine metadata
43+
- MinIO as the local S3-compatible asset store
44+
- Document Engine itself
45+
- Caddy as the reverse proxy in front of Document Engine
46+
47+
## Configuration
48+
49+
The local profile lives in `document-engine.env.sh`. `setup.sh` sources that file before calling `docker compose`, so you can either:
50+
51+
- Edit `document-engine.env.sh`
52+
- Override individual values in your shell before running `./setup.sh`
53+
54+
Example:
55+
56+
```bash
57+
export CADDY_HTTP_PORT=5050
58+
export DASHBOARD_PASSWORD=supersecret
59+
./setup.sh
60+
```
61+
62+
You can also choose the Document Engine image tag at invocation time:
63+
64+
```bash
65+
./setup.sh
66+
./setup.sh nightly
67+
```
68+
69+
The default tag is `latest`.
70+
71+
The default profile is tuned for multi-GB uploads:
72+
73+
- `MAX_UPLOAD_SIZE_BYTES=50000000000`
74+
- `SERVER_REQUEST_TIMEOUT=900000`
75+
- `PSPDFKIT_WORKER_TIMEOUT=900000`
76+
- `FILE_UPLOAD_TIMEOUT_MS=900000`
77+
- `ASSET_STORAGE_CACHE_SIZE=20000000000`
78+
79+
To actually test large documents, you must provide a valid `ACTIVATION_KEY`. Without one, Document Engine runs with license-imposed trial limits and uploads are capped at `50 MB`, even though the local proxy and runtime configuration allow much larger request bodies.
80+
81+
Caddy is configured to:
82+
83+
- Allow request bodies up to `50GB`
84+
- Proxy `/i/d/.../h/.../page-*` tile-rendering requests upstream over HTTP/2
85+
- Keep all other upstream requests on HTTP/1.1 with `response_header_timeout 15m`
86+
87+
The dashboard is exposed at plain `http://localhost:5000/dashboard`. TLS is not needed: Caddy's h2c reverse proxy uses Go's `http.Transport`, which multiplexes incoming HTTP/1.1 requests onto a single upstream HTTP/2 connection. This gives Document Engine the HTTP/2 stream multiplexing it needs to reuse a single rendering worker across concurrent tile requests.
88+
89+
## API usage examples
90+
91+
Once the stack is running, you can use the [Document Engine API](https://www.nutrient.io/api/reference/document-engine/upstream/) directly through Caddy.
92+
93+
### Upload a document
94+
95+
```bash
96+
curl --request POST \
97+
--url http://localhost:5000/api/documents \
98+
--header 'Authorization: Token token=secret' \
99+
--form 'pdf-file-from-multipart=@sample.pdf' \
100+
--form 'instructions={"parts":[{"file":"pdf-file-from-multipart"}],"actions":[],"output":{"type":"pdf","metadata":{"title":"Test Document","author":"API User"}}}'
101+
```
102+
103+
### List documents
104+
105+
```bash
106+
curl --request GET \
107+
--url http://localhost:5000/api/documents \
108+
--header 'Authorization: Token token=secret'
109+
```
110+
111+
### Get document information
112+
113+
Replace `DOCUMENT_ID` with the actual document ID from the upload response:
114+
115+
```bash
116+
curl --request GET \
117+
--url http://localhost:5000/api/documents/DOCUMENT_ID/document_info \
118+
--header 'Authorization: Token token=secret'
119+
```
120+
121+
### Extract text
122+
123+
```bash
124+
curl --request GET \
125+
--url http://localhost:5000/api/documents/DOCUMENT_ID/pages/text \
126+
--header 'Authorization: Token token=secret'
127+
```
128+
129+
### Download PDF
130+
131+
```bash
132+
curl --request GET \
133+
--url http://localhost:5000/api/documents/DOCUMENT_ID/pdf \
134+
--header 'Authorization: Token token=secret' \
135+
--output downloaded-document.pdf
136+
```
137+
138+
## Cleanup
139+
140+
Stop the stack:
141+
142+
```bash
143+
source document-engine.env.sh
144+
docker compose down
145+
```
146+
147+
Remove the stack and local volumes:
148+
149+
```bash
150+
source document-engine.env.sh
151+
docker compose down -v
152+
```
153+
154+
## Support
155+
156+
Nutrient offers support for customers with an active SDK license via [Nutrient Support](https://www.nutrient.io/support/request/).
157+
158+
Are you [evaluating our SDK](https://www.nutrient.io/sdk/try)? That's great, we're happy to help out. To make sure this is fast, please use a work email and have someone from your company fill out our [sales form](https://www.nutrient.io/contact-sales/).
159+
160+
## License
161+
162+
This project is licensed under the BSD license. See the LICENSE file for more details.
163+
164+
## Contributing
165+
166+
Please ensure you have signed our CLA so that we can accept your contributions.
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
services:
2+
db:
3+
image: ${POSTGRES_IMAGE}
4+
environment:
5+
POSTGRES_USER: ${POSTGRES_USER}
6+
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
7+
POSTGRES_DB: ${POSTGRES_DB}
8+
POSTGRES_INITDB_ARGS: --data-checksums
9+
PGDATA: /var/lib/postgresql/data/pgdata
10+
healthcheck:
11+
test: ["CMD-SHELL", "pg_isready -U $$POSTGRES_USER -d $$POSTGRES_DB"]
12+
interval: 5s
13+
timeout: 5s
14+
retries: 30
15+
start_period: 5s
16+
restart: unless-stopped
17+
volumes:
18+
- pgdata:/var/lib/postgresql/data
19+
20+
minio:
21+
image: ${MINIO_IMAGE}
22+
command: server /data --console-address ":9001"
23+
environment:
24+
MINIO_ROOT_USER: ${MINIO_ROOT_USER}
25+
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
26+
ports:
27+
- "${MINIO_CONSOLE_PORT}:9001"
28+
restart: unless-stopped
29+
volumes:
30+
- minio-data:/data
31+
32+
minio-setup:
33+
image: ${MINIO_MC_IMAGE}
34+
depends_on:
35+
minio:
36+
condition: service_started
37+
entrypoint:
38+
- /bin/sh
39+
command:
40+
- -ec
41+
- |
42+
until mc alias set local http://minio:9000 "$MINIO_ROOT_USER" "$MINIO_ROOT_PASSWORD"; do
43+
sleep 2
44+
done
45+
mc mb --ignore-existing local/"$ASSET_STORAGE_S3_BUCKET"
46+
environment:
47+
MINIO_ROOT_USER: ${MINIO_ROOT_USER}
48+
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
49+
ASSET_STORAGE_S3_BUCKET: ${ASSET_STORAGE_S3_BUCKET}
50+
restart: "no"
51+
52+
document-engine:
53+
image: ${DOCUMENT_ENGINE_IMAGE}
54+
depends_on:
55+
db:
56+
condition: service_healthy
57+
minio-setup:
58+
condition: service_completed_successfully
59+
environment:
60+
PGUSER: ${POSTGRES_USER}
61+
PGPASSWORD: ${POSTGRES_PASSWORD}
62+
PGDATABASE: ${POSTGRES_DB}
63+
PGHOST: db
64+
PGPORT: 5432
65+
API_AUTH_TOKEN: ${API_AUTH_TOKEN}
66+
SECRET_KEY_BASE: ${SECRET_KEY_BASE}
67+
DASHBOARD_USERNAME: ${DASHBOARD_USERNAME}
68+
DASHBOARD_PASSWORD: ${DASHBOARD_PASSWORD}
69+
ACTIVATION_KEY: ${ACTIVATION_KEY}
70+
ASSET_STORAGE_BACKEND: ${ASSET_STORAGE_BACKEND}
71+
ASSET_STORAGE_S3_BUCKET: ${ASSET_STORAGE_S3_BUCKET}
72+
ASSET_STORAGE_S3_ACCESS_KEY_ID: ${ASSET_STORAGE_S3_ACCESS_KEY_ID}
73+
ASSET_STORAGE_S3_SECRET_ACCESS_KEY: ${ASSET_STORAGE_S3_SECRET_ACCESS_KEY}
74+
ASSET_STORAGE_S3_HOST: ${ASSET_STORAGE_S3_HOST}
75+
ASSET_STORAGE_S3_PORT: ${ASSET_STORAGE_S3_PORT}
76+
ASSET_STORAGE_S3_SCHEME: ${ASSET_STORAGE_S3_SCHEME}
77+
ASSET_STORAGE_S3_REGION: ${ASSET_STORAGE_S3_REGION}
78+
HTTP2_SHARED_RENDERING_PROCESS_ENABLE: ${HTTP2_SHARED_RENDERING_PROCESS_ENABLE}
79+
HTTP2_SHARED_RENDERING_PROCESS_CHECKIN_TIMEOUT: ${HTTP2_SHARED_RENDERING_PROCESS_CHECKIN_TIMEOUT}
80+
HTTP2_SHARED_RENDERING_PROCESS_CHECKOUT_TIMEOUT: ${HTTP2_SHARED_RENDERING_PROCESS_CHECKOUT_TIMEOUT}
81+
MAX_UPLOAD_SIZE_BYTES: ${MAX_UPLOAD_SIZE_BYTES}
82+
SERVER_REQUEST_TIMEOUT: ${SERVER_REQUEST_TIMEOUT}
83+
PSPDFKIT_WORKER_TIMEOUT: ${PSPDFKIT_WORKER_TIMEOUT}
84+
FILE_UPLOAD_TIMEOUT_MS: ${FILE_UPLOAD_TIMEOUT_MS}
85+
READ_ANNOTATION_BATCH_TIMEOUT: ${READ_ANNOTATION_BATCH_TIMEOUT}
86+
ASSET_STORAGE_CACHE_SIZE: ${ASSET_STORAGE_CACHE_SIZE}
87+
ASSET_STORAGE_CACHE_TIMEOUT: ${ASSET_STORAGE_CACHE_TIMEOUT}
88+
healthcheck:
89+
test: ["CMD-SHELL", "curl -f http://localhost:5000/healthcheck || exit 1"]
90+
interval: 2s
91+
timeout: 5s
92+
retries: 30
93+
start_period: 30s
94+
restart: unless-stopped
95+
96+
caddy:
97+
image: ${CADDY_IMAGE}
98+
depends_on:
99+
document-engine:
100+
condition: service_healthy
101+
ports:
102+
- "${CADDY_HTTP_PORT}:5000"
103+
restart: unless-stopped
104+
volumes:
105+
- ./Caddyfile:/etc/caddy/Caddyfile:ro
106+
107+
volumes:
108+
pgdata:
109+
minio-data:

0 commit comments

Comments
 (0)