What is the fastest Python library for DOCX, XLSX, and PPTX?

Office Oxide is the fastest. DOCX text extraction averages 0.8ms (vs 11.8ms for python-docx — 14× faster). XLSX averages 5.0ms (vs 94.5ms for openpyxl — 18× faster). PPTX averages 0.7ms (vs 32.5ms for python-pptx — 46× faster). Benchmarked on 6,062 real-world files.

Is Office Oxide free for commercial use?

Yes. Office Oxide is dual-licensed MIT OR Apache-2.0 — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL or copyleft restrictions.

Does Office Oxide handle legacy .doc, .xls, and .ppt files?

Yes. Office Oxide reads all six formats: DOCX, XLSX, PPTX, plus legacy DOC, XLS, PPT. It is the only Rust or Python library that supports all three legacy formats without a JVM (Apache Tika) or external binaries (catdoc, antiword).

Can Office Oxide convert documents to Markdown?

Yes. Every supported format has built-in to_markdown() that preserves headings, tables, lists, and structure — ideal for LLM and RAG pipelines. No separate package needed.

How does Office Oxide compare to calamine and openpyxl for XLSX?

On 1,802 XLSX files: Office Oxide averages 5.0ms (97.8% pass rate). python-calamine averages 13.9ms (96.6%). openpyxl averages 94.5ms (96.2%). Office Oxide is 2.8× faster than calamine and 18× faster than openpyxl, with the highest pass rate.

Does Office Oxide work in the browser?

Yes. Office Oxide ships a WASM build (office-oxide-wasm on npm) that runs in any browser or bundler. Process Office documents client-side with no server round-trips — useful for privacy-sensitive workloads.

Переход с python-docx

Office Oxide — drop-in замена для самых частых сценариев python-docx (извлечение текста, обход абзацев, чтение таблиц, find-and-replace) в 14 раз быстрее и с pass-rate на 3,8 п. п. выше на корпусе из 2 538 DOCX. Бонус: вам больше не нужно тащить отдельные библиотеки для .xlsx (openpyxl), .pptx (python-pptx) и legacy .doc (catdoc / antiword) — один pip install покрывает все шесть форматов.

Когда мигрировать

Переключайтесь, если делаете что-то из перечисленного:

Извлекаете текст или Markdown из .docx для ингеста / RAG / поиска
Гоните find-and-replace-шаблоны по тысячам документов
Читаете таблицы из договоров или отчётов
Хотите заодно обрабатывать .xlsx, .pptx или legacy без новых зависимостей

Останьтесь на python-docx, если делаете это и не готовы спускаться в формат-специфичный Rust-API:

Строите сложный DOCX с нуля с custom-стилями и темами
Нужны расширения python-docx (например, docxcompose, python-docx-ng)

Установка

pip uninstall python-docx
pip install office-oxide

Имя дистрибутива в PyPI — office-oxide (через дефис), import — office_oxide (через подчёркивание).

Шпаргалка side-by-side

Plain text

python-docx

from docx import Document

doc = Document("report.docx")
text = "\n".join(p.text for p in doc.paragraphs)

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    text = doc.plain_text()

Один вызов метода, включая колонтитулы, и в ~14 раз быстрее.

Markdown / HTML

python-docx — встроенного Markdown / HTML нет; пришлось бы цеплять pandoc, mammoth или писать свой конвертер.

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    md   = doc.to_markdown()
    html = doc.to_html()

Обход абзацев

python-docx

from docx import Document

doc = Document("report.docx")
for p in doc.paragraphs:
    print(p.style.name, p.text)

office_oxide (через IR)

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    for el in section["elements"]:
        if el["kind"] == "Heading":
            print(f"H{el['level']}", el["text"])
        elif el["kind"] == "Paragraph":
            print("P", " ".join(r["text"] for r in el["runs"]))

Обход таблиц

python-docx

from docx import Document

doc = Document("report.docx")
for table in doc.tables:
    for row in table.rows:
        cells = [cell.text for cell in row.cells]
        print(cells)

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    for el in section["elements"]:
        if el["kind"] == "Table":
            for row in el["rows"]:
                print(row)

Find and replace (шаблонизация)

python-docx — нет first-class-API; типичный паттерн — обход всех run и переписывание текста, что ломается на cross-run-совпадениях. Многие тащат docx-mailmerge или пишут хрупкие regex.

office_oxide

from office_oxide import EditableDocument

with EditableDocument.open("template.docx") as ed:
    n = ed.replace_text("{{client_name}}", "Acme Corp")
    print(f"{n} замен")
    ed.save("filled.docx")

replace_text обрабатывает cross-run-совпадения прозрачно и сохраняет все нетронутые OPC-части (изображения, диаграммы, стили).

Чтение core properties

python-docx

from docx import Document

doc = Document("report.docx")
print(doc.core_properties.author)
print(doc.core_properties.modified)

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    props = doc.as_docx().core_properties()
    print(props.author)
    print(props.modified)

Что office_oxide пока не выставляет

Унифицированный EditableDocument покрывает шаблонизацию. Для богатее DOCX-конструкции — добавление абзацев, программная сборка таблиц, применение именованных стилей — спускайтесь в формат-специфичный модуль:

from office_oxide.docx import DocxBuilder

builder = DocxBuilder()
builder.add_heading("Q4-отчёт", level=1)
builder.add_paragraph("Выручка выросла на 18%.")
builder.save("report.docx")

Или генерируйте из IR через create_from_ir(ir, "docx", "report.docx"). См. Сборка из IR.

Производительность

Тот же корпус из 2 538 файлов, single-thread:

Библиотека	Среднее	p99	Pass rate
office_oxide	0,8 мс	3,9 мс	98,9%
python-docx	11,8 мс	98 мс	95,1%

Ингест миллиона документов, который у python-docx занимает 3 ч 16 мин, на том же железе у office_oxide заканчивается за 14 минут.

Смотрите также

Переход с openpyxl — XLSX
Переход с python-pptx — PPTX
Бенчмарки производительности — полные числа
Обзор редактирования — что сохраняет EditableDocument