What is the fastest Python library for DOCX, XLSX, and PPTX?

Office Oxide is the fastest. DOCX text extraction averages 0.8ms (vs 11.8ms for python-docx — 14× faster). XLSX averages 5.0ms (vs 94.5ms for openpyxl — 18× faster). PPTX averages 0.7ms (vs 32.5ms for python-pptx — 46× faster). Benchmarked on 6,062 real-world files.

Is Office Oxide free for commercial use?

Yes. Office Oxide is dual-licensed MIT OR Apache-2.0 — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL or copyleft restrictions.

Does Office Oxide handle legacy .doc, .xls, and .ppt files?

Yes. Office Oxide reads all six formats: DOCX, XLSX, PPTX, plus legacy DOC, XLS, PPT. It is the only Rust or Python library that supports all three legacy formats without a JVM (Apache Tika) or external binaries (catdoc, antiword).

Can Office Oxide convert documents to Markdown?

Yes. Every supported format has built-in to_markdown() that preserves headings, tables, lists, and structure — ideal for LLM and RAG pipelines. No separate package needed.

How does Office Oxide compare to calamine and openpyxl for XLSX?

On 1,802 XLSX files: Office Oxide averages 5.0ms (97.8% pass rate). python-calamine averages 13.9ms (96.6%). openpyxl averages 94.5ms (96.2%). Office Oxide is 2.8× faster than calamine and 18× faster than openpyxl, with the highest pass rate.

Does Office Oxide work in the browser?

Yes. Office Oxide ships a WASM build (office-oxide-wasm on npm) that runs in any browser or bundler. Process Office documents client-side with no server round-trips — useful for privacy-sensitive workloads.

python-docx からの移行

Office Oxide は最も一般的な python-docx 用途 — テキスト抽出、段落イテレーション、テーブル読み取り、検索置換 — のドロップイン置換です。14 倍の速度 で、2,538 ファイルの DOCX コーパスでパス率は 3.8 ポイント高い。ボーナスとして、.xlsx（openpyxl）、.pptx（python-pptx）、レガシー .doc（catdoc / antiword）に別々のライブラリをベンダリングする必要がなくなります: 1 回の pip install で 6 形式すべてをカバー。

いつ移行するか

以下のいずれかをやっているなら切り替えを:

.docx から取り込み / RAG / 検索のためにテキストや Markdown を抽出
何千ものドキュメントで検索置換テンプレートを実行
契約書やレポートからテーブルを読む
依存を増やさずに .xlsx、.pptx、レガシーフォーマットも処理したい

これらをやっていてフォーマット固有の Rust API に降りる準備がないなら python-docx に留まってください:

カスタムスタイルとテーマでゼロから複雑な DOCX を構築
python-docx 拡張ライブラリ（例: docxcompose、python-docx-ng）が必要

インストール

pip uninstall python-docx
pip install office-oxide

PyPI 配布名は office-oxide（ハイフン）、import は office_oxide（アンダースコア）。

並べて比較するチートシート

プレーンテキスト

python-docx

from docx import Document

doc = Document("report.docx")
text = "\n".join(p.text for p in doc.paragraphs)

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    text = doc.plain_text()

メソッド呼び出し 1 回、ヘッダとフッタを含み、約 14 倍高速。

Markdown / HTML

python-docx — Markdown / HTML の組み込みなし; pandoc、mammoth に頼るか、自分でコンバータを書く必要があります。

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    md   = doc.to_markdown()
    html = doc.to_html()

段落のイテレーション

python-docx

from docx import Document

doc = Document("report.docx")
for p in doc.paragraphs:
    print(p.style.name, p.text)

office_oxide（IR 経由）

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    for el in section["elements"]:
        if el["kind"] == "Heading":
            print(f"H{el['level']}", el["text"])
        elif el["kind"] == "Paragraph":
            print("P", " ".join(r["text"] for r in el["runs"]))

テーブルのイテレーション

python-docx

from docx import Document

doc = Document("report.docx")
for table in doc.tables:
    for row in table.rows:
        cells = [cell.text for cell in row.cells]
        print(cells)

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    for el in section["elements"]:
        if el["kind"] == "Table":
            for row in el["rows"]:
                print(row)

検索と置換（テンプレート化）

python-docx — ファーストクラス API なし; 一般的なパターンはすべての run を歩いてテキストを書き換えるが、クロス run マッチで壊れる。多くのユーザーは docx-mailmerge をベンダリングするか脆弱な regex を書く。

office_oxide

from office_oxide import EditableDocument

with EditableDocument.open("template.docx") as ed:
    n = ed.replace_text("{{client_name}}", "Acme Corp")
    print(f"{n} 件置換")
    ed.save("filled.docx")

replace_text はクロス run マッチを透過的に処理し、すべての変更されていない OPC パーツ（画像、グラフ、スタイル）を温存します。

コアプロパティの読み取り

python-docx

from docx import Document

doc = Document("report.docx")
print(doc.core_properties.author)
print(doc.core_properties.modified)

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    props = doc.as_docx().core_properties()
    print(props.author)
    print(props.modified)

office_oxide が現在公開していないもの

統一 EditableDocument がテンプレート用途をカバーします。よりリッチな DOCX 構築 — 段落の追加、プログラム的なテーブル構築、名前付きスタイルの適用 — にはフォーマット固有モジュールに降りてください:

from office_oxide.docx import DocxBuilder

builder = DocxBuilder()
builder.add_heading("Q4 レポート", level=1)
builder.add_paragraph("売上が 18% 成長しました。")
builder.save("report.docx")

または create_from_ir(ir, "docx", "report.docx") で IR から生成。IR から構築を参照。

パフォーマンス

同じ 2,538 ファイルコーパス、シングルスレッド:

ライブラリ	平均	p99	通過率
office_oxide	0.8 ms	3.9 ms	98.9%
python-docx	11.8 ms	98 ms	95.1%

100 万ドキュメントの取り込みが python-docx で 3 時間 16 分 かかるところ、同じハードウェアで office_oxide なら 14 分 で完了します。

python-docx からの移行

いつ移行するか

インストール

並べて比較するチートシート

プレーンテキスト

Markdown / HTML

段落のイテレーション

テーブルのイテレーション

検索と置換（テンプレート化）

コアプロパティの読み取り

office_oxide が現在公開していないもの

パフォーマンス

関連項目