What is the fastest Python library for DOCX, XLSX, and PPTX?

Office Oxide is the fastest. DOCX text extraction averages 0.8ms (vs 11.8ms for python-docx — 14× faster). XLSX averages 5.0ms (vs 94.5ms for openpyxl — 18× faster). PPTX averages 0.7ms (vs 32.5ms for python-pptx — 46× faster). Benchmarked on 6,062 real-world files.

Is Office Oxide free for commercial use?

Yes. Office Oxide is dual-licensed MIT OR Apache-2.0 — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL or copyleft restrictions.

Does Office Oxide handle legacy .doc, .xls, and .ppt files?

Yes. Office Oxide reads all six formats: DOCX, XLSX, PPTX, plus legacy DOC, XLS, PPT. It is the only Rust or Python library that supports all three legacy formats without a JVM (Apache Tika) or external binaries (catdoc, antiword).

Can Office Oxide convert documents to Markdown?

Yes. Every supported format has built-in to_markdown() that preserves headings, tables, lists, and structure — ideal for LLM and RAG pipelines. No separate package needed.

How does Office Oxide compare to calamine and openpyxl for XLSX?

On 1,802 XLSX files: Office Oxide averages 5.0ms (97.8% pass rate). python-calamine averages 13.9ms (96.6%). openpyxl averages 94.5ms (96.2%). Office Oxide is 2.8× faster than calamine and 18× faster than openpyxl, with the highest pass rate.

Does Office Oxide work in the browser?

Yes. Office Oxide ships a WASM build (office-oxide-wasm on npm) that runs in any browser or bundler. Process Office documents client-side with no server round-trips — useful for privacy-sensitive workloads.

从 Apache Tika 迁移

Apache Tika 是从海量文件格式中提取文本的事实标准 JVM 库——涵盖 DOCX、XLSX、PPTX 以及旧版 DOC、XLS、PPT。如果你的管道只处理 Office 文档，Office Oxide 就是合适的替代方案：同样的六种格式，不需要 JVM，原生速度，部署更简单。

何时迁移

满足以下任意条件就可以切换：

摄取管道只处理 Office 文档（不需要 Tika 兼顾的 PDF、EPUB、RTF、ODT 等格式）
不想在容器 / Lambda / 桌面应用里捆绑并调优 JVM
需要 Python、Node.js、Go、C# 或 Rust 的原生绑定——不仅仅是一个 Java JAR
单文件延迟至关重要；Tika 的启动开销和 JVM 预热会拖慢短周期 worker
需要为 LLM 和 RAG 管道生成结构化 Markdown / IR 输出

继续留在 Tika 的场景：

需要摄取 Office Oxide 尚不支持的长尾格式（Tika 支持约 1,400 种文件类型）
已有 JVM 摄取服务，为此做架构调整代价太高
依赖 Tika 对这些长尾格式的 MIME 检测

常见折中方案：长尾格式保留 Tika，把 .docx / .xlsx / .pptx / .doc / .xls / .ppt（在大多数企业语料库中占绝大多数）交给 Office Oxide 处理。

安装

Python

pip install office-oxide

（替换 tika 或 apache-tika 的 Python 包装，以及它们依赖的 JVM。）

Rust

[dependencies]
office_oxide = "0.1.0"

Java

如果你坚持使用 JVM，可以通过 C FFI 配合 JNA / JNR-FFI 来调用 Office Oxide。也可以将 office_oxide_cli 作为 sidecar 进程通过 stdio 调用——同一个引擎，无需任何 JVM 桥接代码。

Python 对照速查

纯文本提取

Tika（REST 模式）

import tika
from tika import parser

tika.initVM()    # JVM startup; ~1-2s on first call
parsed = parser.from_file("report.docx")
text = parsed["content"]
metadata = parsed["metadata"]

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    text = doc.plain_text()
    props = doc.as_docx().core_properties()    # author, modified, etc.

无 JVM 启动，无 REST 往返，亚毫秒级提取。

字节输入（无需临时文件）

Tika

import io, requests
from tika import parser

data = requests.get(url).content
parsed = parser.from_buffer(io.BytesIO(data))

office_oxide

import requests
from office_oxide import Document

data = requests.get(url).content
with Document.from_bytes(data, "docx") as doc:
    print(doc.plain_text())

服务端 / 批量处理

Tika — 通常以 tika-server 模式运行在 HTTP 后面。

java -jar tika-server.jar -h 0.0.0.0 -p 9998

import requests
text = requests.put("http://localhost:9998/tika",
                     data=open("report.docx", "rb"),
                     headers={"Accept": "text/plain"}).text

office_oxide — 丢掉 JVM 和服务器，直接调用库即可。如果需要 sidecar 架构（语言无关的客户端），可以使用 MCP 服务器或通过 stdio 调用 CLI。

JVM 用户对照

如果你的管道是 Java/Kotlin/Scala 且不想放弃 JVM：

非 Office 的所有内容继续交给 Tika。
Office 格式改为调用 office-oxide。两种方式：
- 通过 JNA / JNR-FFI 调用 liboffice_oxide 和 C 头文件 include/office_oxide_c/office_oxide.h，与 Go、C# 绑定使用的是同一套 C API。
- 通过 ProcessBuilder 启动 office_oxide_cli sidecar，用 stdin 传入输入，用 stdout 读取输出。重启成本极低，崩溃被完全隔离。

两种方式在处理 Office 格式时都比 Tika 更快——同时也避免了 JVM 套 JVM 的怪异问题。

与 Tika 的功能对比

	Tika	Office Oxide
DOCX、XLSX、PPTX	✓	✓
旧版 DOC、XLS、PPT	✓	✓
PDF、EPUB、RTF、ODT 等	✓	✗（PDF 请使用 pdf_oxide）
纯文本提取	✓	✓
Markdown 输出	部分支持	✓（内置 `to_markdown`）
结构化 IR / JSON	XHTML SAX 事件	✓（类型化 `DocumentIR`）
查找替换模板化	✗	✓（`EditableDocument`）
单元格写入（XLSX）	✗	✓（`set_cell`）
旧版 → 新版转换	✗	✓（`save_as`）
需要 JVM	✓	✗
原生速度	JVM 开销	每文件 <1 ms

性能（仅 Office 格式）

对一百万份 Office 文档（DOCX、XLSX、PPTX、DOC、XLS、PPT 混合）进行摄取，与 tika-server 对比结果如下：

管道	实际耗时	备注
tika-server（REST），8 个 worker	~3 h 40 m	含 HTTP 开销
tika-app（进程内 JVM），8 个 worker	~1 h 50 m	Tika 最优情况
office_oxide，8 个 worker	~3 m	原生解析

具体数字因格式比例而异；对于以摄取为主的 Office 工作负载，差距通常在 30–60 倍。