Convert Office Documents to Markdown
Every Office Oxide handle has a to_markdown() method that produces GitHub-flavored Markdown — headings, tables, lists, and code-like blocks — from any of the six supported formats. This is the entry point most LLM and RAG pipelines should use.
One-shot
Rust
use office_oxide::to_markdown;
let md = to_markdown("report.docx")?;
std::fs::write("report.md", md)?;
Python
import office_oxide
md = office_oxide.to_markdown("report.docx")
open("report.md", "w").write(md)
JavaScript
import { toMarkdown } from 'office-oxide';
import { writeFileSync } from 'node:fs';
writeFileSync('report.md', toMarkdown('report.docx'));
Go
md, err := officeoxide.ToMarkdown("report.docx")
os.WriteFile("report.md", []byte(md), 0o644)
C#
File.WriteAllText("report.md", OfficeOxide.ToMarkdown("report.docx"));
Reusable handle
Python
from office_oxide import Document
with Document.open("deck.pptx") as doc:
md = doc.to_markdown()
Rust
let doc = office_oxide::Document::open("deck.pptx")?;
let md = doc.to_markdown();
JavaScript
using doc = Document.open('deck.pptx');
const md = doc.toMarkdown();
What gets emitted
| Source element | Markdown |
|---|---|
DOCX heading (<w:pStyle w:val="Heading1"/> …) |
# Heading (level matches style) |
| DOCX paragraph | One paragraph, soft hyphens stripped |
| DOCX list item | - item or 1. item (numbering preserved) |
| DOCX table | GFM pipe table |
| XLSX sheet | ## Sheet name followed by a pipe table per range |
| XLSX merged cells | First cell content, span dropped |
| PPTX slide | ## Slide N + body, with notes appended as blockquotes |
| PPTX table | GFM pipe table inline with the slide |
| Hyperlinks | [text](url) |
| Images |  placeholder — see “Images” below |
Images
to_markdown() emits image placeholders by name (e.g. ) but does not extract image bytes — Markdown is a text format. To pull images, use the IR or format-specific access:
from office_oxide import Document
with Document.open("report.docx") as doc:
ir = doc.to_ir()
for section in ir["sections"]:
for el in section["elements"]:
if el["kind"] == "Image":
print(el["filename"], len(el["data"]))
See IR extraction for the full schema.
Use cases
- RAG ingestion — Markdown is the most LLM-friendly input format. One pass per document, deterministic structure, no HTML noise.
- Document indexing — Headings give natural chunk boundaries; tables stay queryable.
- Migrations — DOCX → Markdown for static-site generators (Hugo, Astro, MkDocs).
- Diffing content — Markdown diffs are far more reviewable than
.docxbinary diffs.
Performance
to_markdown() runs at the same order of magnitude as plain_text() — typically 1–2× the cost on the median document. See Performance for full numbers.
See also
- HTML extraction — when you need styled output instead
- IR extraction — structured JSON for richer pipelines
- PDF for RAG — companion library
pdf_oxidecovers PDFs