Skip to content

Migrate from python-docx

Office Oxide is a drop-in replacement for the most common python-docx use cases — text extraction, paragraph iteration, table reading, find-and-replace — at 14× the speed, with a 3.8 percentage-point higher pass rate on a 2,538-file DOCX corpus. As a bonus, you stop having to vendor different libraries for .xlsx (openpyxl), .pptx (python-pptx), and legacy .doc (catdoc / antiword): one pip install covers all six formats.

When to migrate

Switch if you do any of these:

  • Extract text or Markdown from .docx for ingestion / RAG / search
  • Run find-and-replace templating across thousands of docs
  • Read tables out of contracts or reports
  • Want to also process .xlsx, .pptx, or legacy formats without adding more dependencies

Stay on python-docx if you do these and aren’t ready to drop into the format-specific Rust API:

  • Build complex DOCX from scratch with custom styles and themes
  • Need python-docx extension libraries (e.g. docxcompose, python-docx-ng)

Install

pip uninstall python-docx
pip install office-oxide

The PyPI distribution is office-oxide (hyphen); the import is office_oxide (underscore).

Side-by-side cheat sheet

Plain text

python-docx

from docx import Document

doc = Document("report.docx")
text = "\n".join(p.text for p in doc.paragraphs)

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    text = doc.plain_text()

One method call, includes headers and footers, and ~14× faster.

Markdown / HTML

python-docx — no built-in Markdown / HTML; you’d reach for pandoc, mammoth, or hand-roll a converter.

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    md   = doc.to_markdown()
    html = doc.to_html()

Iterate paragraphs

python-docx

from docx import Document

doc = Document("report.docx")
for p in doc.paragraphs:
    print(p.style.name, p.text)

office_oxide (via the IR)

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    for el in section["elements"]:
        if el["kind"] == "Heading":
            print(f"H{el['level']}", el["text"])
        elif el["kind"] == "Paragraph":
            print("P", " ".join(r["text"] for r in el["runs"]))

Iterate tables

python-docx

from docx import Document

doc = Document("report.docx")
for table in doc.tables:
    for row in table.rows:
        cells = [cell.text for cell in row.cells]
        print(cells)

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    for el in section["elements"]:
        if el["kind"] == "Table":
            for row in el["rows"]:
                print(row)

Find and replace (templating)

python-docx — no first-class API; common pattern is to walk every run and rewrite text, which breaks on cross-run matches. Most users vendor docx-mailmerge or write fragile regex.

office_oxide

from office_oxide import EditableDocument

with EditableDocument.open("template.docx") as ed:
    n = ed.replace_text("{{client_name}}", "Acme Corp")
    print(f"{n} replacements")
    ed.save("filled.docx")

replace_text handles cross-run matches transparently and preserves all unmodified OPC parts (images, charts, styles).

Read core properties

python-docx

from docx import Document

doc = Document("report.docx")
print(doc.core_properties.author)
print(doc.core_properties.modified)

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    props = doc.as_docx().core_properties()
    print(props.author)
    print(props.modified)

What office_oxide doesn’t currently expose

The unified EditableDocument covers the templating use case. For richer DOCX construction — adding paragraphs, building tables programmatically, applying named styles — drop into the format-specific module:

from office_oxide.docx import DocxBuilder

builder = DocxBuilder()
builder.add_heading("Q4 Report", level=1)
builder.add_paragraph("Revenue grew 18%.")
builder.save("report.docx")

Or generate from the IR with create_from_ir(ir, "docx", "report.docx"). See Build from IR.

Performance

Same 2,538-file corpus, single-threaded:

Library Mean p99 Pass Rate
office_oxide 0.8 ms 3.9 ms 98.9%
python-docx 11.8 ms 98 ms 95.1%

A million-document ingestion that takes python-docx 3 hours 16 minutes finishes in 14 minutes with office_oxide on the same hardware.

See also