You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: support of more file formats + fallbacks (#155)
This pull request introduces significant improvements to the document
extraction pipeline, enhances deployment configuration for caching and
permissions, and refines documentation to reflect these changes. The
main focus is on a more robust, layered fallback mechanism for file
extraction, expanded format support, and improved container
orchestration for model caches. Additionally, environment variables and
configuration maps have been streamlined for clarity and
maintainability.
**Document extraction pipeline improvements:**
* The extraction pipeline now orchestrates Docling, MarkItDown, and
custom extractors in a deterministic fallback chain, ensuring that if
one extractor fails, the next is tried automatically. The default order
is configurable, and the pipeline covers a broader range of formats
including Office docs, spreadsheets, Markdown/AsciiDoc, CSV, TXT, EPUB,
HTML/XML, and raster images.
[[1]](diffhunk://#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5L54-R54)
[[2]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aL13-R95)
[[3]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aL48-R130)
* The `README.md` and `libs/extractor-api-lib/README.md` have been
updated to document the new fallback logic, supported formats, and
configuration options. The documentation now includes detailed tables of
extractor priorities and extension mappings, as well as instructions for
customizing the pipeline.
[[1]](diffhunk://#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5L112-R114)
[[2]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aL13-R95)
[[3]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aL48-R130)
[[4]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aL83-R172)
**Deployment and configuration enhancements:**
* Added support for HuggingFace and ModelScope model cache directories
in the extractor deployment, with corresponding environment variables
(`HF_HOME`, `HUGGINGFACE_HUB_CACHE`, `MODELSCOPE_HOME`,
`XDG_CACHE_HOME`) and volume mounts. These cache paths are now
configurable via `values.yaml`.
[[1]](diffhunk://#diff-673dd2d3d4e66a8fd4e45f9c1c9900711313f946bf8b6a89e96c954988fc14f3R404-R406)
[[2]](diffhunk://#diff-289e7e7aa5f8a10603dafc1c094fa3487201006a7d5429a0dd9c6c80b3426fcfR28-R63)
[[3]](diffhunk://#diff-289e7e7aa5f8a10603dafc1c094fa3487201006a7d5429a0dd9c6c80b3426fcfR80-R81)
[[4]](diffhunk://#diff-289e7e7aa5f8a10603dafc1c094fa3487201006a7d5429a0dd9c6c80b3426fcfL99-R128)
[[5]](diffhunk://#diff-3ab40efdb049da16ac327c9fbaf8ec1d25f26efbeded4e0c2cfd7f50b976d3ceR80-R87)
* Improved init container scripts for both admin-backend and extractor
deployments: added strict error handling (`set -euo pipefail`), ensured
cleanup of temporary files, and set correct permissions and ownership
for NLTK data and cache directories.
[[1]](diffhunk://#diff-2b6f7f2ec4938055207faa53acf7a300e0ec235db31d1cfb6896703b97292348R39-R49)
[[2]](diffhunk://#diff-289e7e7aa5f8a10603dafc1c094fa3487201006a7d5429a0dd9c6c80b3426fcfR28-R63)
**Configuration and environment variable cleanup:**
* Removed the now-obsolete `pdfextractor` configmap and related
environment variables, consolidating extractor configuration and
simplifying Helm templates.
[[1]](diffhunk://#diff-3ab40efdb049da16ac327c9fbaf8ec1d25f26efbeded4e0c2cfd7f50b976d3ceL55-L58)
[[2]](diffhunk://#diff-d72bec7914fc3e7d3fe01a8c0cbdb24832a26956bae5563d109bf8bb19955e0eL12-L20)
[[3]](diffhunk://#diff-673dd2d3d4e66a8fd4e45f9c1c9900711313f946bf8b6a89e96c954988fc14f3L467-L469)
[[4]](diffhunk://#diff-2b6f7f2ec4938055207faa53acf7a300e0ec235db31d1cfb6896703b97292348L111-L112)
* Updated Python version specification in `pyproject.toml` to use a
version range instead of a caret, and added a per-file ignore for
docstring warnings in `__init__.py`.
[[1]](diffhunk://#diff-dede389bcfb615c4b45cd1da7ac14cbe9535305f41f19cce09e321c91a8bb323R46)
[[2]](diffhunk://#diff-dede389bcfb615c4b45cd1da7ac14cbe9535305f41f19cce09e321c91a8bb323L79-R80)
---------
Co-authored-by: Andreas Klos <andreas.klos@stackit.cloud>
Copy file name to clipboardExpand all lines: README.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -51,7 +51,7 @@ Welcome to the STACKIT RAG Template! This is a basic example of how to use the R
51
51
52
52
## Features 🚀
53
53
54
-
**Document Management**: Supports PDFs, DOCX, PPTX, XML, EPUB documents and websource via confluence as well as sitemaps.
54
+
**Document Management**: Supports PDFs, Office docs (DOCX, PPTX), spreadsheets (XLSX), Markdown/AsciiDoc (MD, MDX, ADOC), EPUB/HTML/XML, CSV/TXT, and raster images, with automatic fallbacks between Docling, MarkItDown, and custom extractors; also handles Confluence spaces and sitemaps.
55
55
56
56
**AI Integration**: Multiple LLM and embedder providers for flexibility.
57
57
@@ -109,9 +109,9 @@ All components are provided by the *admin-api-lib*. For further information on e
109
109
110
110
#### 1.1.3 Document extractor
111
111
112
-
The Document extractor is a component that is used to extract the content from the documents and confluence spaces.
112
+
The Document extractor ingests uploaded files and remote sources (Confluence, sitemap) and now orchestrates multiple extractors with a deterministic fallback chain. Docling runs first for rich formats (PDF, Office, Markdown, HTML, images), MarkItDown provides lightweight markdown conversion, and specialised custom extractors (PDF, MS Office, XML, EPUB, Tesseract OCR) handle edge cases. The order and availability can be customised through the dependency-injector container.
113
113
114
-
All components are provided by the *extractor-api-lib*. For further information on endpointsand requirements, please consult [the libs README](./libs/README.md#3-extractor-api-lib).
114
+
All components are provided by the *extractor-api-lib*. For further information on endpoints, extractor ordering, supported formats, and configuration tips, please consult [the libs README](./libs/README.md#3-extractor-api-lib).
Copy file name to clipboardExpand all lines: libs/extractor-api-lib/README.md
+97-11Lines changed: 97 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,11 +10,89 @@ Content ingestion layer for the STACKIT RAG template. This library exposes a Fas
10
10
11
11
## Feature highlights
12
12
13
-
-**Broad format coverage** – PDFs, DOCX, PPTX, XML/EPUB, Confluence spaces, and sitemap-driven websites.
13
+
-**Layered extraction pipeline** – Docling, MarkItDown, and the custom extractors now cooperate with a deterministic fallback chain, so a failed run automatically cascades to the next extractor.
14
+
-**Expanded format coverage** – PDFs, Office documents, EPUB, XML, Markdown/AsciiDoc, CSV/TXT, raster images, Confluence spaces, and sitemap-driven websites.
14
15
-**Consistent output schema** – Information pieces are returned in a unified structure with content type (`TEXT`, `TABLE`, `IMAGE`) and metadata.
15
16
-**Swappable extractors** – Dependency-injector container makes it easy to add or replace file/source extractors, table converters, etc.
16
17
-**Production-grade plumbing** – Built-in S3-compatible file service, LangChain loaders with retry/backoff, optional PDF OCR, and throttling controls for web crawls.
17
18
19
+
## File extractor pipeline
20
+
21
+
[`GeneralFileExtractor`](src/extractor_api_lib/impl/api_endpoints/general_file_extractor.py) orchestrates file parsing. It resolves the file type from the extension, filters the extractors that declare matching `compatible_file_types`, reverses that filtered list, and then executes the extractors in sequence until one returns content or all have failed. Exceptions are logged and the next extractor takes over; only if every extractor either returns no content or raises an exception do we bubble up an error.
22
+
23
+
### Default execution order
24
+
25
+
The dependency container wires extractors in the following list:
26
+
27
+
1.`DoclingFileExtractor`
28
+
2.`MarkitdownFileExtractor`
29
+
3.`PDFExtractor`
30
+
4.`EpubExtractor`
31
+
5.`XMLExtractor`
32
+
6.`MSDocsExtractor`
33
+
7.`TesseractImageExtractor`
34
+
35
+
Because the orchestrator reverses the candidate list before the fallback loop, the priority for overlapping formats is the reverse of this wiring. For example, PDFs run through Docling first, then fall back to MarkItDown, and finally to the custom PDF extractor; DOCX/PPTX files follow Docling → MarkItDown → MSDocs; raster images go through Docling’s OCR pipeline before falling back to the Tesseract-only extractor.
36
+
37
+
### Supported formats
38
+
39
+
| Format family | Extensions | Primary extractor | Fallbacks | Notes |
| PDF |`.pdf`| Docling | MarkItDown → Custom PDF extractor | Docling performs OCR + table extraction; the PDF extractor keeps Camelot/pdfplumber heuristics as a last resort. |
42
+
| Microsoft Word |`.docx`| Docling | MarkItDown → MSDocs | MSDocs keeps unstructured-based table conversion for custom cases. |
43
+
| Microsoft PowerPoint |`.pptx`| Docling | MarkItDown → MSDocs | MarkItDown splits slides by `<!-- Slide number: N -->`. |
44
+
| Microsoft Excel |`.xlsx`| Docling | — | Tables returned as markdown; Docling infers sheet structure. |
Image coverage currently excludes animated GIF, WebP, HEIC, and SVG files. These extensions are ignored by the routing logic and will surface as “No extractor found” errors until an extractor declares support.
55
+
56
+
### Source extractor pipeline
57
+
58
+
`GeneralSourceExtractor` wires Confluence and sitemap loaders behind a similar abstraction. Unlike files, source extractors are keyed by `ExtractionParameters.source_type` and the matching extractor is called directly (no fallback chain).
59
+
60
+
## Configuring extractor order
61
+
62
+
The order lives in `DependencyContainer.file_extractors`. You can override it either by subclassing the container or by overriding the provider at runtime before wiring the FastAPI app. Example:
63
+
64
+
`container.py`
65
+
66
+
```python
67
+
from dependency_injector.providers import List
68
+
69
+
from extractor_api_lib.dependency_container import DependencyContainer
The last provider in the list becomes the first extractor tried for a matching file type. Keep shared singleton providers (file service, converters) in the parent class to avoid double instantiation.
95
+
18
96
## Installation
19
97
20
98
```bash
@@ -45,11 +123,11 @@ Both endpoints stream their results back to `admin-api-lib`, which takes care of
45
123
46
124
## How the file extraction endpoint works
47
125
48
-
1. Download the file from S3
49
-
2.Chose suitable file extractor based on the filename ending
50
-
3.Extract the content from the file
51
-
4.Map the internal representation to the external schema
52
-
5.Return the final output
126
+
1. Download the file from S3.
127
+
2.Derive the file type from the extension (normalizing common image/Markdown/AsciiDoc aliases).
128
+
3.Select extractors that declare support for the resolved `FileType`.
129
+
4.Run the extractors in priority order (highest priority first); stop at the first non-empty result or keep falling back if an extractor raises.
130
+
5.Map the internal representation to the external schema and return the final output.
53
131
54
132
## How the source extraction endpoint works
55
133
@@ -64,7 +142,6 @@ Both endpoints stream their results back to `admin-api-lib`, which takes care of
64
142
Two `pydantic-settings` models ship with this package:
65
143
66
144
-**S3 storage** (`S3Settings`) – configure the built-in file service with `S3_ACCESS_KEY_ID`, `S3_SECRET_ACCESS_KEY`, `S3_ENDPOINT`, and `S3_BUCKET`.
67
-
-**PDF extraction** (`PDFExtractorSettings`) – adjust footer trimming or diagram export via `PDF_EXTRACTOR_FOOTER_HEIGHT` and `PDF_EXTRACTOR_DIAGRAMS_FOLDER_NAME`.
68
145
69
146
Other extractors accept their parameters at runtime through the request payload (`ExtractionParameters`). For example, the admin backend forwards Confluence credentials, sitemap URLs, or custom headers when it calls `/extract_from_source`. This keeps the library stateless and makes it easy to plug in additional sources without redeploying.
70
147
@@ -80,10 +157,19 @@ from extractor_api_lib.main import app as perfect_extractor_app
80
157
81
158
## Extending the library
82
159
83
-
1. Implement `InformationFileExtractor` or `InformationExtractor` for your new format/source.
84
-
2. Register it in `dependency_container.py` (append to `file_extractors` list or `source_extractors` dict).
85
-
3. Update mapper or metadata handling if additional fields are required.
86
-
4. Add unit tests under `libs/extractor-api-lib/tests` using fixtures and fake storage providers.
0 commit comments