Skip to content

Commit 7bed810

Browse files
author
damon
committed
feat: add --images flag to extract embedded images as base64 markdown
Extract images from .docx, .pptx, and .xlsx files and render them as reference-style base64 markdown (e.g. ![][image1] with definitions at document end), matching Google Docs' markdown export format. Supports JPEG, PNG, GIF, WebP, BMP, and SVG; skips unsupported formats like EMF/WMF. Opt-in via -i/--images since base64 bloats output. Bump version to 1.3.0.
1 parent 44f103a commit 7bed810

9 files changed

Lines changed: 844 additions & 48 deletions

File tree

Cargo.lock

Lines changed: 2 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "batdoc"
3-
version = "1.2.0"
3+
version = "1.3.0"
44
edition = "2021"
55
description = "cat(1) for doc, docx, xls, xlsx, pptx, and pdf -- renders to markdown with bat"
66
license = "MIT"
@@ -15,6 +15,7 @@ name = "batdoc"
1515
path = "src/main.rs"
1616

1717
[dependencies]
18+
base64 = "0.22"
1819
bat = { version = "0.26.1", default-features = false, features = ["regex-fancy", "paging"] }
1920
cfb = "0.13"
2021
encoding_rs = "0.8"

README.md

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -83,12 +83,27 @@ cat FILE | batdoc [OPTIONS]
8383
8484
-p, --plain plain text, no highlighting
8585
-m, --markdown force markdown (default on tty)
86+
-i, --images embed images as inline base64 data URIs
8687
-h, --help help
8788
```
8889

90+
`--images` extracts embedded images from `.docx`, `.pptx`, and `.xlsx`
91+
files and includes them as `![](data:image/...;base64,...)` in the
92+
markdown output. Most useful when piping to a file:
93+
94+
```
95+
batdoc --images report.docx > report.md
96+
```
97+
98+
The resulting markdown is self-contained — no external image files
99+
needed. JPEG, PNG, GIF, WebP, and BMP images are supported; vector
100+
formats (EMF/WMF) are silently skipped. Ignored in plain text mode
101+
and for formats without OOXML image support (`.doc`, `.xls`, `.pdf`).
102+
89103
## Known limitations
90104

91-
- Text only — no images, charts, or embedded objects.
105+
- `--images` supports `.docx`/`.pptx`/`.xlsx` only. Legacy `.doc`/`.xls`
106+
images are in MSODRAW binary format and not extracted. No PDF images.
92107
- `.doc` heading/table detection is heuristic. It's good, not perfect.
93108
- Only BIFF8 (Excel 97+). Older BIFF5 `.xls` files won't parse.
94109
- No legacy `.ppt` support — only modern `.pptx`.
@@ -99,7 +114,7 @@ cat FILE | batdoc [OPTIONS]
99114

100115
## Dependencies
101116

102-
Seven crates, no C, no system libs: `bat`, `cfb`, `encoding_rs`,
117+
Eight crates, no C, no system libs: `base64`, `bat`, `cfb`, `encoding_rs`,
103118
`pdf-extract`, `quick-xml`, `zip`, `is-terminal`.
104119

105120
## History

0 commit comments

Comments
 (0)