Question 1

What is the fastest Python library for DOCX, XLSX, and PPTX?

Accepted Answer

Office Oxide is the fastest. DOCX text extraction averages 0.8ms (vs 11.8ms for python-docx — 14× faster). XLSX averages 5.0ms (vs 94.5ms for openpyxl — 18× faster). PPTX averages 0.7ms (vs 32.5ms for python-pptx — 46× faster). Benchmarked on 6,062 real-world files.

Question 2

Is Office Oxide free for commercial use?

Accepted Answer

Yes. Office Oxide is dual-licensed MIT OR Apache-2.0 — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL or copyleft restrictions.

Question 3

Does Office Oxide handle legacy .doc, .xls, and .ppt files?

Accepted Answer

Yes. Office Oxide reads all six formats: DOCX, XLSX, PPTX, plus legacy DOC, XLS, PPT. It is the only Rust or Python library that supports all three legacy formats without a JVM (Apache Tika) or external binaries (catdoc, antiword).

Question 4

Can Office Oxide convert documents to Markdown?

Accepted Answer

Yes. Every supported format has built-in to_markdown() that preserves headings, tables, lists, and structure — ideal for LLM and RAG pipelines. No separate package needed.

Question 5

How does Office Oxide compare to calamine and openpyxl for XLSX?

Accepted Answer

On 1,802 XLSX files: Office Oxide averages 5.0ms (97.8% pass rate). python-calamine averages 13.9ms (96.6%). openpyxl averages 94.5ms (96.2%). Office Oxide is 2.8× faster than calamine and 18× faster than openpyxl, with the highest pass rate.

Question 6

Does Office Oxide work in the browser?

Accepted Answer

Yes. Office Oxide ships a WASM build (office-oxide-wasm on npm) that runs in any browser or bundler. Process Office documents client-side with no server round-trips — useful for privacy-sensitive workloads.

Формат	Вивід
DOCX	Текст body у порядку документа + колонтитули; soft-hyphen вирізаються
XLSX	Значення клітинок усіх листів, у рядку через табуляцію, між листами — порожній рядок
PPTX	Заголовок слайду, body-плейсхолдери, клітинки таблиць та нотатки — один абзац-блок на слайд
DOC	Та сама форма, що й DOCX — парситься прямо з piece-table CFB
XLS	Та сама форма, що й XLSX — парситься прямо з BIFF8-записів
PPT	Та сама форма, що й PPTX — парситься з потоку PowerPoint Document

Формат	Середнє	p99	Pass rate
DOCX (2 538 файлів)	0,8 мс	3,9 мс	98,9%
XLSX (1 802 файли)	5,0 мс	40 мс	97,8%
PPTX (806 файлів)	0,7 мс	3,9 мс	98,4%
DOC (246 файлів)	0,3 мс	3,4 мс	94,7%
XLS (494 файли)	2,8 мс	75 мс	99,2%
PPT (176 файлів)	0,7 мс	6,6 мс	100%

Видобування тексту з Office-документів

Одноразовий помічник

Rust

Python

JavaScript

Go

C#

Перевикористання handle

Rust

Python

JavaScript

Що ви отримуєте для кожного формату

З байтів (без temp-файлу)

Python

JavaScript

Rust

Продуктивність

Дивіться також