What is the fastest Python library for DOCX, XLSX, and PPTX?

Office Oxide is the fastest. DOCX text extraction averages 0.8ms (vs 11.8ms for python-docx — 14× faster). XLSX averages 5.0ms (vs 94.5ms for openpyxl — 18× faster). PPTX averages 0.7ms (vs 32.5ms for python-pptx — 46× faster). Benchmarked on 6,062 real-world files.

Is Office Oxide free for commercial use?

Yes. Office Oxide is dual-licensed MIT OR Apache-2.0 — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL or copyleft restrictions.

Does Office Oxide handle legacy .doc, .xls, and .ppt files?

Yes. Office Oxide reads all six formats: DOCX, XLSX, PPTX, plus legacy DOC, XLS, PPT. It is the only Rust or Python library that supports all three legacy formats without a JVM (Apache Tika) or external binaries (catdoc, antiword).

Can Office Oxide convert documents to Markdown?

Yes. Every supported format has built-in to_markdown() that preserves headings, tables, lists, and structure — ideal for LLM and RAG pipelines. No separate package needed.

How does Office Oxide compare to calamine and openpyxl for XLSX?

On 1,802 XLSX files: Office Oxide averages 5.0ms (97.8% pass rate). python-calamine averages 13.9ms (96.6%). openpyxl averages 94.5ms (96.2%). Office Oxide is 2.8× faster than calamine and 18× faster than openpyxl, with the highest pass rate.

Does Office Oxide work in the browser?

Yes. Office Oxide ships a WASM build (office-oxide-wasm on npm) that runs in any browser or bundler. Process Office documents client-side with no server round-trips — useful for privacy-sensitive workloads.

Перехід з python-docx

Office Oxide — drop-in заміна для найпоширеніших сценаріїв python-docx (видобування тексту, обхід абзаців, читання таблиць, find-and-replace) у 14 разів швидше і з pass-rate на 3,8 в.п. вище на корпусі з 2 538 DOCX. Бонусом: вам більше не треба тягати окремі бібліотеки для .xlsx (openpyxl), .pptx (python-pptx) та legacy .doc (catdoc / antiword) — один pip install покриває всі шість форматів.

Коли мігрувати

Перемикайтеся, якщо робите щось із цього:

Видобуваєте текст чи Markdown із .docx для інжесту / RAG / пошуку
Ганяєте find-and-replace-шаблони по тисячах документів
Читаєте таблиці з договорів чи звітів
Хочете заодно обробляти .xlsx, .pptx чи legacy без нових залежностей

Залишайтеся на python-docx, якщо робите це й не готові спускатися у форматно-специфічний Rust-API:

Будуєте складні DOCX з нуля з custom-стилями та темами
Потрібні розширення python-docx (наприклад, docxcompose, python-docx-ng)

Встановлення

pip uninstall python-docx
pip install office-oxide

Назва дистрибутива у PyPI — office-oxide (через дефіс), import — office_oxide (через підкреслення).

Шпаргалка side-by-side

Plain text

python-docx

from docx import Document

doc = Document("report.docx")
text = "\n".join(p.text for p in doc.paragraphs)

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    text = doc.plain_text()

Один виклик методу, включно з колонтитулами, у ~14 разів швидше.

Markdown / HTML

python-docx — вбудованого Markdown / HTML немає; довелося б тягтися до pandoc, mammoth або писати власний конвертер.

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    md   = doc.to_markdown()
    html = doc.to_html()

Обхід абзаців

python-docx

from docx import Document

doc = Document("report.docx")
for p in doc.paragraphs:
    print(p.style.name, p.text)

office_oxide (через IR)

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    for el in section["elements"]:
        if el["kind"] == "Heading":
            print(f"H{el['level']}", el["text"])
        elif el["kind"] == "Paragraph":
            print("P", " ".join(r["text"] for r in el["runs"]))

Обхід таблиць

python-docx

from docx import Document

doc = Document("report.docx")
for table in doc.tables:
    for row in table.rows:
        cells = [cell.text for cell in row.cells]
        print(cells)

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    for el in section["elements"]:
        if el["kind"] == "Table":
            for row in el["rows"]:
                print(row)

Find and replace (шаблонізація)

python-docx — first-class-API немає; типовий патерн — обхід усіх run і переписування тексту, що ламається на cross-run-збігах. Багато хто тягне docx-mailmerge або пише крихкі regex.

office_oxide

from office_oxide import EditableDocument

with EditableDocument.open("template.docx") as ed:
    n = ed.replace_text("{{client_name}}", "Acme Corp")
    print(f"{n} замін")
    ed.save("filled.docx")

replace_text обробляє cross-run-збіги прозоро й зберігає всі незмінені OPC-частини (зображення, діаграми, стилі).

Читання core properties

python-docx

from docx import Document

doc = Document("report.docx")
print(doc.core_properties.author)
print(doc.core_properties.modified)

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    props = doc.as_docx().core_properties()
    print(props.author)
    print(props.modified)

Що office_oxide наразі не виставляє

Уніфікований EditableDocument покриває шаблонізацію. Для багатшої DOCX-конструкції — додавання абзаців, програмна збірка таблиць, застосування іменованих стилів — спускайтеся у форматно-специфічний модуль:

from office_oxide.docx import DocxBuilder

builder = DocxBuilder()
builder.add_heading("Q4-звіт", level=1)
builder.add_paragraph("Дохід зріс на 18%.")
builder.save("report.docx")

Або генеруйте з IR через create_from_ir(ir, "docx", "report.docx"). Дивіться Збірка з IR.

Продуктивність

Той самий корпус із 2 538 файлів, single-thread:

Бібліотека	Середнє	p99	Pass rate
office_oxide	0,8 мс	3,9 мс	98,9%
python-docx	11,8 мс	98 мс	95,1%

Інжест мільйона документів, що в python-docx займає 3 год 16 хв, на тому ж залізі в office_oxide завершується за 14 хвилин.

Дивіться також

Перехід з openpyxl — XLSX
Перехід з python-pptx — PPTX
Бенчмарки продуктивності — повні цифри
Огляд редагування — що зберігає EditableDocument