Migrate from python-docx
Office Oxide is a drop-in replacement for the most common python-docx use cases — text extraction, paragraph iteration, table reading, find-and-replace — at 14× the speed, with a 3.8 percentage-point higher pass rate on a 2,538-file DOCX corpus. As a bonus, you stop having to vendor different libraries for .xlsx (openpyxl), .pptx (python-pptx), and legacy .doc (catdoc / antiword): one pip install covers all six formats.
When to migrate
Switch if you do any of these:
- Extract text or Markdown from
.docxfor ingestion / RAG / search - Run find-and-replace templating across thousands of docs
- Read tables out of contracts or reports
- Want to also process
.xlsx,.pptx, or legacy formats without adding more dependencies
Stay on python-docx if you do these and aren’t ready to drop into the format-specific Rust API:
- Build complex DOCX from scratch with custom styles and themes
- Need
python-docxextension libraries (e.g.docxcompose,python-docx-ng)
Install
pip uninstall python-docx
pip install office-oxide
The PyPI distribution is office-oxide (hyphen); the import is office_oxide (underscore).
Side-by-side cheat sheet
Plain text
python-docx
from docx import Document
doc = Document("report.docx")
text = "\n".join(p.text for p in doc.paragraphs)
office_oxide
from office_oxide import Document
with Document.open("report.docx") as doc:
text = doc.plain_text()
One method call, includes headers and footers, and ~14× faster.
Markdown / HTML
python-docx — no built-in Markdown / HTML; you’d reach for pandoc, mammoth, or hand-roll a converter.
office_oxide
from office_oxide import Document
with Document.open("report.docx") as doc:
md = doc.to_markdown()
html = doc.to_html()
Iterate paragraphs
python-docx
from docx import Document
doc = Document("report.docx")
for p in doc.paragraphs:
print(p.style.name, p.text)
office_oxide (via the IR)
from office_oxide import Document
with Document.open("report.docx") as doc:
ir = doc.to_ir()
for section in ir["sections"]:
for el in section["elements"]:
if el["kind"] == "Heading":
print(f"H{el['level']}", el["text"])
elif el["kind"] == "Paragraph":
print("P", " ".join(r["text"] for r in el["runs"]))
Iterate tables
python-docx
from docx import Document
doc = Document("report.docx")
for table in doc.tables:
for row in table.rows:
cells = [cell.text for cell in row.cells]
print(cells)
office_oxide
from office_oxide import Document
with Document.open("report.docx") as doc:
ir = doc.to_ir()
for section in ir["sections"]:
for el in section["elements"]:
if el["kind"] == "Table":
for row in el["rows"]:
print(row)
Find and replace (templating)
python-docx — no first-class API; common pattern is to walk every run and rewrite text, which breaks on cross-run matches. Most users vendor docx-mailmerge or write fragile regex.
office_oxide
from office_oxide import EditableDocument
with EditableDocument.open("template.docx") as ed:
n = ed.replace_text("{{client_name}}", "Acme Corp")
print(f"{n} replacements")
ed.save("filled.docx")
replace_text handles cross-run matches transparently and preserves all unmodified OPC parts (images, charts, styles).
Read core properties
python-docx
from docx import Document
doc = Document("report.docx")
print(doc.core_properties.author)
print(doc.core_properties.modified)
office_oxide
from office_oxide import Document
with Document.open("report.docx") as doc:
props = doc.as_docx().core_properties()
print(props.author)
print(props.modified)
What office_oxide doesn’t currently expose
The unified EditableDocument covers the templating use case. For richer DOCX construction — adding paragraphs, building tables programmatically, applying named styles — drop into the format-specific module:
from office_oxide.docx import DocxBuilder
builder = DocxBuilder()
builder.add_heading("Q4 Report", level=1)
builder.add_paragraph("Revenue grew 18%.")
builder.save("report.docx")
Or generate from the IR with create_from_ir(ir, "docx", "report.docx"). See Build from IR.
Performance
Same 2,538-file corpus, single-threaded:
| Library | Mean | p99 | Pass Rate |
|---|---|---|---|
| office_oxide | 0.8 ms | 3.9 ms | 98.9% |
| python-docx | 11.8 ms | 98 ms | 95.1% |
A million-document ingestion that takes python-docx 3 hours 16 minutes finishes in 14 minutes with office_oxide on the same hardware.
See also
- Migrate from openpyxl — XLSX
- Migrate from python-pptx — PPTX
- Performance benchmarks — full numbers
- Editing overview — what
EditableDocumentpreserves