Python Office Library — Quick Start
office_oxide is the fastest Python library for Office documents. Pure-Rust core, idiomatic Python API, no runtime dependencies. Read DOCX, XLSX, PPTX (and legacy DOC, XLS, PPT) in under a millisecond — 8 to 100× faster than python-docx, openpyxl, and python-pptx.
Install
pip install office-oxide
Wheels published for CPython 3.8–3.14 on Linux, macOS, and Windows. The PyPI distribution is office-oxide (hyphen); the import is office_oxide (underscore).
Read a document
from office_oxide import Document
with Document.open("report.docx") as doc:
print(doc.plain_text())
Or the one-shot helper:
import office_oxide
print(office_oxide.extract_text("report.docx"))
Core API
Document.open detects the format from the file extension (and double-checks via magic bytes). It accepts str, bytes, or any os.PathLike. Use it as a context manager so native memory is released deterministically.
from pathlib import Path
from office_oxide import Document
with Document.open(Path("data/deck.pptx")) as doc:
print(doc.format) # "pptx"
print(doc.plain_text()) # str
print(doc.to_markdown()) # str — GitHub-flavored Markdown
print(doc.to_html()) # str — semantic HTML
ir = doc.to_ir() # nested dict (DocumentIR)
doc.save_as("deck.docx") # legacy PPT → PPTX works too
Open from raw bytes when the file isn’t on disk:
data = open("report.xlsx", "rb").read()
with Document.from_bytes(data, "xlsx") as doc:
print(doc.plain_text())
Module-level shortcuts:
import office_oxide
office_oxide.extract_text("file.docx") # → str
office_oxide.to_markdown("file.pptx") # → str
office_oxide.to_html("file.xlsx") # → str
office_oxide.version() # → "0.1.0"
Editing
EditableDocument preserves every unmodified OPC part (images, charts, styles, relationships) on save. DOCX, XLSX, and PPTX only.
from office_oxide import EditableDocument
with EditableDocument.open("template.docx") as ed:
n = ed.replace_text("{{name}}", "Alice")
print(f"{n} replacements")
ed.save("out.docx")
Replace text in DOCX / PPTX
from office_oxide import EditableDocument
with EditableDocument.open("slides.pptx") as ed:
ed.replace_text("Q3", "Q4")
ed.replace_text("2024", "2025")
ed.save("slides_q4.pptx")
replace_text walks <w:t> elements in DOCX and <a:t> across every slide in PPTX. Returns the replacement count.
Write XLSX cells
from office_oxide import EditableDocument
with EditableDocument.open("budget.xlsx") as ed:
ed.set_cell(0, "A1", "Total") # string
ed.set_cell(0, "B1", 42.5) # number (int also accepted)
ed.set_cell(0, "C1", True) # boolean
ed.set_cell(0, "D1", None) # empty
ed.save("budget.xlsx")
sheet_index is zero-based; cell_ref uses standard spreadsheet notation.
Format-agnostic IR
doc.to_ir() returns a nested dict mirroring the Rust DocumentIR: sections of headings, paragraphs, tables, lists, and images. Useful for pipelines or for feeding LLMs structured context.
ir = doc.to_ir()
for section in ir["sections"]:
print(section.get("title"))
for el in section["elements"]:
kind = el["kind"] # "Heading" | "Paragraph" | "Table" | "List" | ...
Bytes-based pipelines
from_bytes avoids temp files in serverless / streaming workflows:
import requests
from office_oxide import Document
data = requests.get("https://example.com/doc.docx").content
with Document.from_bytes(data, "docx") as doc:
print(doc.to_markdown())
Legacy formats
DOC, XLS, and PPT route through the same API. Extension detection picks the legacy CFB parser automatically; save_as transparently produces a modern OOXML file:
with Document.open("legacy.doc") as doc:
print(doc.plain_text())
doc.save_as("legacy.docx") # DOC → DOCX in one line
Errors
Parse and IO failures raise OfficeOxideError. save_as IO failures are wrapped in IOError.
from office_oxide import Document, OfficeOxideError
try:
with Document.open("weird.file") as doc:
print(doc.plain_text())
except OfficeOxideError as e:
print(f"office_oxide failed: {e}")
except FileNotFoundError:
print("no such file")
Troubleshooting
| Symptom | Likely cause |
|---|---|
OfficeOxideError: unsupported format: "" |
No extension on the path — use Document.from_bytes(data, "docx"). |
RuntimeError: Document is closed |
You exited the with block and still hold a reference. Open a fresh handle. |
ImportError: _native |
Wheel didn’t match your platform. pip install --force-reinstall office-oxide. |
| Legacy DOC renders as gibberish | File may be encrypted (Word 97 RC4). office_oxide does not decrypt — use LibreOffice first. |
| Unicode issues on Windows | Use pathlib.Path instead of byte paths; Document.open handles platform encoding. |
See also
- Migrate from python-docx — drop-in replacement, 14× faster
- Migrate from openpyxl — same XLSX coverage, 18× faster
- Migrate from python-pptx — 46× faster on PPTX
- Performance benchmarks — full numbers across 6,062 files
- Package on PyPI