Skip to content

Convert Office Documents to Markdown

Every Office Oxide handle has a to_markdown() method that produces GitHub-flavored Markdown — headings, tables, lists, and code-like blocks — from any of the six supported formats. This is the entry point most LLM and RAG pipelines should use.

One-shot

Rust

use office_oxide::to_markdown;

let md = to_markdown("report.docx")?;
std::fs::write("report.md", md)?;

Python

import office_oxide

md = office_oxide.to_markdown("report.docx")
open("report.md", "w").write(md)

JavaScript

import { toMarkdown } from 'office-oxide';
import { writeFileSync } from 'node:fs';

writeFileSync('report.md', toMarkdown('report.docx'));

Go

md, err := officeoxide.ToMarkdown("report.docx")
os.WriteFile("report.md", []byte(md), 0o644)

C#

File.WriteAllText("report.md", OfficeOxide.ToMarkdown("report.docx"));

Reusable handle

Python

from office_oxide import Document

with Document.open("deck.pptx") as doc:
    md = doc.to_markdown()

Rust

let doc = office_oxide::Document::open("deck.pptx")?;
let md = doc.to_markdown();

JavaScript

using doc = Document.open('deck.pptx');
const md = doc.toMarkdown();

What gets emitted

Source element Markdown
DOCX heading (<w:pStyle w:val="Heading1"/> …) # Heading (level matches style)
DOCX paragraph One paragraph, soft hyphens stripped
DOCX list item - item or 1. item (numbering preserved)
DOCX table GFM pipe table
XLSX sheet ## Sheet name followed by a pipe table per range
XLSX merged cells First cell content, span dropped
PPTX slide ## Slide N + body, with notes appended as blockquotes
PPTX table GFM pipe table inline with the slide
Hyperlinks [text](url)
Images ![alt](filename) placeholder — see “Images” below

Images

to_markdown() emits image placeholders by name (e.g. ![](image1.png)) but does not extract image bytes — Markdown is a text format. To pull images, use the IR or format-specific access:

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()
    for section in ir["sections"]:
        for el in section["elements"]:
            if el["kind"] == "Image":
                print(el["filename"], len(el["data"]))

See IR extraction for the full schema.

Use cases

  • RAG ingestion — Markdown is the most LLM-friendly input format. One pass per document, deterministic structure, no HTML noise.
  • Document indexing — Headings give natural chunk boundaries; tables stay queryable.
  • Migrations — DOCX → Markdown for static-site generators (Hugo, Astro, MkDocs).
  • Diffing content — Markdown diffs are far more reviewable than .docx binary diffs.

Performance

to_markdown() runs at the same order of magnitude as plain_text() — typically 1–2× the cost on the median document. See Performance for full numbers.

See also