What is the fastest Python library for DOCX, XLSX, and PPTX?

Office Oxide is the fastest. DOCX text extraction averages 0.8ms (vs 11.8ms for python-docx — 14× faster). XLSX averages 5.0ms (vs 94.5ms for openpyxl — 18× faster). PPTX averages 0.7ms (vs 32.5ms for python-pptx — 46× faster). Benchmarked on 6,062 real-world files.

Is Office Oxide free for commercial use?

Yes. Office Oxide is dual-licensed MIT OR Apache-2.0 — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL or copyleft restrictions.

Does Office Oxide handle legacy .doc, .xls, and .ppt files?

Yes. Office Oxide reads all six formats: DOCX, XLSX, PPTX, plus legacy DOC, XLS, PPT. It is the only Rust or Python library that supports all three legacy formats without a JVM (Apache Tika) or external binaries (catdoc, antiword).

Can Office Oxide convert documents to Markdown?

Yes. Every supported format has built-in to_markdown() that preserves headings, tables, lists, and structure — ideal for LLM and RAG pipelines. No separate package needed.

How does Office Oxide compare to calamine and openpyxl for XLSX?

On 1,802 XLSX files: Office Oxide averages 5.0ms (97.8% pass rate). python-calamine averages 13.9ms (96.6%). openpyxl averages 94.5ms (96.2%). Office Oxide is 2.8× faster than calamine and 18× faster than openpyxl, with the highest pass rate.

Does Office Oxide work in the browser?

Yes. Office Oxide ships a WASM build (office-oxide-wasm on npm) that runs in any browser or bundler. Process Office documents client-side with no server round-trips — useful for privacy-sensitive workloads.

从 python-pptx 迁移

Office Oxide 读取 PPTX 比 python-pptx 快 46 倍(806 个文件,平均 0.7 ms 对 32.5 ms),通过率高出 11.7 个百分点。它还能直接读取遗留 .ppt — python-pptx 做不到。

何时迁移

如果你在做下列任何一项,就切换:

从 .pptx 提取幻灯片文本、备注或表格,用于数据摄取 / RAG
把演示文稿转换为 Markdown 或 HTML 以供预览
运行查找替换类模板(“Q3 → Q4”、“{{quarter}}”、“{{growth}}”)
需要 .ppt 支持,但不想调用 LibreOffice
想用一个库同时覆盖 .docx、.xlsx 以及遗留格式

继续用 python-pptx 的场景:

从零构建带自定义版式、动画、切换效果和形状几何的复杂 PPTX
需要对幻灯片版式 XML 的细粒度控制

安装

pip uninstall python-pptx
pip install office-oxide

对照速查

读取所有幻灯片文本

python-pptx

from pptx import Presentation

prs = Presentation("deck.pptx")
for slide in prs.slides:
    for shape in slide.shapes:
        if shape.has_text_frame:
            for para in shape.text_frame.paragraphs:
                for run in para.runs:
                    print(run.text)

office_oxide

from office_oxide import Document

with Document.open("deck.pptx") as doc:
    text = doc.plain_text()
print(text)

按幻灯片遍历

python-pptx

prs = Presentation("deck.pptx")
for i, slide in enumerate(prs.slides, 1):
    title = slide.shapes.title.text if slide.shapes.title else "(no title)"
    print(f"slide {i}: {title}")

office_oxide

with Document.open("deck.pptx") as doc:
    ir = doc.to_ir()

for i, section in enumerate(ir["sections"], 1):
    print(f"slide {i}: {section.get('title') or '(no title)'}")

每个 IR 章节对应一张幻灯片。section["title"] 来自标题占位符。

读取幻灯片上的表格

python-pptx

for slide in prs.slides:
    for shape in slide.shapes:
        if shape.has_table:
            for row in shape.table.rows:
                cells = [c.text for c in row.cells]
                print(cells)

office_oxide

with Document.open("deck.pptx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    for el in section["elements"]:
        if el["kind"] == "Table":
            for row in el["rows"]:
                print(row)

读取演讲者备注

python-pptx

for slide in prs.slides:
    if slide.has_notes_slide:
        print(slide.notes_slide.notes_text_frame.text)

office_oxide

plain_text() 和 to_markdown() 默认会包含备注 — 它们附在每张幻灯片章节的末尾。如果需要单独获取备注,使用格式专用访问器:

with Document.open("deck.pptx") as doc:
    pptx = doc.as_pptx()
    for i, slide in enumerate(pptx.slides(), 1):
        notes = slide.notes()
        if notes:
            print(f"slide {i} notes: {notes}")

模板化(查找并替换)

python-pptx — 没有一等公民 API;常见做法是遍历每个形状的文本框并重写。跨 run 匹配时容易出错。

office_oxide

from office_oxide import EditableDocument

with EditableDocument.open("deck_template.pptx") as ed:
    ed.replace_text("{{quarter}}", "Q4 2026")
    ed.replace_text("{{growth}}",  "+18.4%")
    ed.save("q4_deck.pptx")

replace_text 会遍历每张幻灯片和备注幻灯片中的每个 <a:t>,并保留所有未修改的 OPC 部件(图片、图表、版式、主题)。

转换为 Markdown / HTML

python-pptx — 没有内置。

office_oxide

with Document.open("deck.pptx") as doc:
    md   = doc.to_markdown()
    html = doc.to_html()

Markdown 输出每张幻灯片一个 ## Slide N 章节,正文紧随其后,备注作为引用块附加。

读取遗留 .ppt

python-pptx 打不开 .ppt。Office Oxide 可以直接读:

from office_oxide import Document

with Document.open("legacy.ppt") as doc:
    print(doc.plain_text())
    doc.save_as("modern.pptx")    # 一行完成迁移

性能

库	平均	p99	通过率
office_oxide	0.7 ms	3.9 ms	98.4%
python-pptx	32.5 ms	174 ms	86.7%

10 万份演示文稿的摄取任务,python-pptx 需要 54 分钟,office_oxide 只要 70 秒。

会丢失什么

EditableDocument 覆盖了模板化用例。要进行更丰富的 PPTX 构建 — 新增幻灯片、自定义版式、图表、动画 — 请下沉到 office_oxide.pptx::create::PptxBuilder,或者在创建环节继续用 python-pptx,摄取环节改用 office_oxide。

从 python-pptx 迁移

何时迁移

安装

对照速查

读取所有幻灯片文本

按幻灯片遍历

读取幻灯片上的表格

读取演讲者备注

模板化(查找并替换)

转换为 Markdown / HTML

读取遗留 .ppt

性能

会丢失什么

相关链接