What is the fastest Python library for DOCX, XLSX, and PPTX?

Office Oxide is the fastest. DOCX text extraction averages 0.8ms (vs 11.8ms for python-docx — 14× faster). XLSX averages 5.0ms (vs 94.5ms for openpyxl — 18× faster). PPTX averages 0.7ms (vs 32.5ms for python-pptx — 46× faster). Benchmarked on 6,062 real-world files.

Is Office Oxide free for commercial use?

Yes. Office Oxide is dual-licensed MIT OR Apache-2.0 — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL or copyleft restrictions.

Does Office Oxide handle legacy .doc, .xls, and .ppt files?

Yes. Office Oxide reads all six formats: DOCX, XLSX, PPTX, plus legacy DOC, XLS, PPT. It is the only Rust or Python library that supports all three legacy formats without a JVM (Apache Tika) or external binaries (catdoc, antiword).

Can Office Oxide convert documents to Markdown?

Yes. Every supported format has built-in to_markdown() that preserves headings, tables, lists, and structure — ideal for LLM and RAG pipelines. No separate package needed.

How does Office Oxide compare to calamine and openpyxl for XLSX?

On 1,802 XLSX files: Office Oxide averages 5.0ms (97.8% pass rate). python-calamine averages 13.9ms (96.6%). openpyxl averages 94.5ms (96.2%). Office Oxide is 2.8× faster than calamine and 18× faster than openpyxl, with the highest pass rate.

Does Office Oxide work in the browser?

Yes. Office Oxide ships a WASM build (office-oxide-wasm on npm) that runs in any browser or bundler. Process Office documents client-side with no server round-trips — useful for privacy-sensitive workloads.

从 python-docx 迁移

Office Oxide 是 python-docx 最常见用法 — 文本抽取、段落遍历、表格读取、查找替换 — 的直接替换：速度快 14 倍，在 2,538 份 DOCX 语料上通过率高 3.8 个百分点。附加好处：你不再需要为 .xlsx（openpyxl）、.pptx（python-pptx）和旧版 .doc（catdoc / antiword）分别引入不同的库 — 一次 pip install 覆盖六种格式。

何时迁移

如果你在做以下任意一项就切换:

从 .docx 提取文本或 Markdown 用于接入 / RAG / 搜索
在成千上万份文档上跑查找替换模板
从合同或报告里读表
想顺便处理 .xlsx、.pptx 或旧版格式而不增加更多依赖

如果你在做以下事并且不准备掉到按格式的 Rust API，留在 python-docx:

用自定义样式和主题从零构建复杂 DOCX
需要 python-docx 的扩展库（如 docxcompose、python-docx-ng）

安装

pip uninstall python-docx
pip install office-oxide

PyPI 包名是 office-oxide（连字符），导入名是 office_oxide（下划线）。

并列对照

纯文本

python-docx

from docx import Document

doc = Document("report.docx")
text = "\n".join(p.text for p in doc.paragraphs)

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    text = doc.plain_text()

一次方法调用，包含页眉页脚，约快 14 倍。

Markdown / HTML

python-docx — 没有内置 Markdown / HTML；你得用 pandoc、mammoth 或者自己写转换器。

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    md   = doc.to_markdown()
    html = doc.to_html()

遍历段落

python-docx

from docx import Document

doc = Document("report.docx")
for p in doc.paragraphs:
    print(p.style.name, p.text)

office_oxide（通过 IR）

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    for el in section["elements"]:
        if el["kind"] == "Heading":
            print(f"H{el['level']}", el["text"])
        elif el["kind"] == "Paragraph":
            print("P", " ".join(r["text"] for r in el["runs"]))

遍历表格

python-docx

from docx import Document

doc = Document("report.docx")
for table in doc.tables:
    for row in table.rows:
        cells = [cell.text for cell in row.cells]
        print(cells)

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    for el in section["elements"]:
        if el["kind"] == "Table":
            for row in el["rows"]:
                print(row)

查找替换（模板化）

python-docx — 没有一等 API；常见做法是遍历每个 run 改写文本，但跨 run 匹配会出错。多数用户选 docx-mailmerge 或写脆弱的正则。

office_oxide

from office_oxide import EditableDocument

with EditableDocument.open("template.docx") as ed:
    n = ed.replace_text("{{client_name}}", "Acme Corp")
    print(f"{n} 处替换")
    ed.save("filled.docx")

replace_text 透明处理跨 run 匹配，并保留所有未修改的 OPC 部件（图片、图表、样式）。

读取核心属性

python-docx

from docx import Document

doc = Document("report.docx")
print(doc.core_properties.author)
print(doc.core_properties.modified)

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    props = doc.as_docx().core_properties()
    print(props.author)
    print(props.modified)

office_oxide 目前不暴露的部分

统一的 EditableDocument 覆盖了模板场景。要做更复杂的 DOCX 构造 — 添加段落、按程序构建表格、应用命名样式 — 请进入按格式的模块:

from office_oxide.docx import DocxBuilder

builder = DocxBuilder()
builder.add_heading("Q4 报告", level=1)
builder.add_paragraph("收入增长 18%。")
builder.save("report.docx")

或从 IR 生成: create_from_ir(ir, "docx", "report.docx")。参见从 IR 构建。

性能

同 2,538 份语料、单线程:

库	均值	p99	通过率
office_oxide	0.8 ms	3.9 ms	98.9%
python-docx	11.8 ms	98 ms	95.1%

百万级文档的接入：python-docx 跑 3 小时 16 分，office_oxide 在同样硬件上 14 分钟 完成。

从 python-docx 迁移

何时迁移

安装

并列对照

纯文本

Markdown / HTML

遍历段落

遍历表格

查找替换（模板化）

读取核心属性

office_oxide 目前不暴露的部分

性能

相关链接