What is the fastest Python library for DOCX, XLSX, and PPTX?

Office Oxide is the fastest. DOCX text extraction averages 0.8ms (vs 11.8ms for python-docx — 14× faster). XLSX averages 5.0ms (vs 94.5ms for openpyxl — 18× faster). PPTX averages 0.7ms (vs 32.5ms for python-pptx — 46× faster). Benchmarked on 6,062 real-world files.

Is Office Oxide free for commercial use?

Yes. Office Oxide is dual-licensed MIT OR Apache-2.0 — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL or copyleft restrictions.

Does Office Oxide handle legacy .doc, .xls, and .ppt files?

Yes. Office Oxide reads all six formats: DOCX, XLSX, PPTX, plus legacy DOC, XLS, PPT. It is the only Rust or Python library that supports all three legacy formats without a JVM (Apache Tika) or external binaries (catdoc, antiword).

Can Office Oxide convert documents to Markdown?

Yes. Every supported format has built-in to_markdown() that preserves headings, tables, lists, and structure — ideal for LLM and RAG pipelines. No separate package needed.

How does Office Oxide compare to calamine and openpyxl for XLSX?

On 1,802 XLSX files: Office Oxide averages 5.0ms (97.8% pass rate). python-calamine averages 13.9ms (96.6%). openpyxl averages 94.5ms (96.2%). Office Oxide is 2.8× faster than calamine and 18× faster than openpyxl, with the highest pass rate.

Does Office Oxide work in the browser?

Yes. Office Oxide ships a WASM build (office-oxide-wasm on npm) that runs in any browser or bundler. Process Office documents client-side with no server round-trips — useful for privacy-sensitive workloads.

Migrate from Apache Tika

Apache Tika is the de-facto JVM library for extracting text from a huge variety of file formats — including DOCX, XLSX, PPTX, and the legacy DOC, XLS, PPT. If your pipeline is Office-document-only, Office Oxide is the right replacement: same six formats, no JVM, native speed, simpler deployment.

When to migrate

Switch if any of these apply:

Your ingestion pipeline only deals with Office documents (you don’t need PDF, EPUB, RTF, ODT, etc. that Tika also handles)
You don’t want to ship and tune a JVM in your container / Lambda / desktop app
You want native bindings in Python, Node.js, Go, C#, or Rust — not just a Java JAR
Per-file latency matters; Tika’s startup cost and JVM warmup hurt short-lived workers
You want structured Markdown / IR output for LLM and RAG pipelines

Stay on Tika if:

You ingest a long tail of formats Office Oxide doesn’t cover (Tika handles ~1,400 file types)
You already have a JVM ingestion service and adding native bindings isn’t worth the architecture change
You depend on Tika’s MIME detection across that long tail

A common middle ground: keep Tika for the long tail, use Office Oxide for .docx / .xlsx / .pptx / .doc / .xls / .ppt (which dominate volume in most enterprise corpora).

Install

Python

pip install office-oxide

(Replaces tika or apache-tika Python wrappers, plus the JVM you were running them against.)

Rust

[dependencies]
office_oxide = "0.1.0"

Java

If you’re committed to JVM, use Office Oxide via its C FFI plus JNA / JNR-FFI. Or run office_oxide_cli as a side-car process called over stdio — same engine, no JVM bridge code.

Side-by-side cheat sheet — Python

Plain text

Tika (REST mode)

import tika
from tika import parser

tika.initVM()    # JVM startup; ~1-2s on first call
parsed = parser.from_file("report.docx")
text = parsed["content"]
metadata = parsed["metadata"]

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    text = doc.plain_text()
    props = doc.as_docx().core_properties()    # author, modified, etc.

No JVM startup, no REST round-trip, sub-millisecond extraction.

Bytes input (no temp file)

Tika

import io, requests
from tika import parser

data = requests.get(url).content
parsed = parser.from_buffer(io.BytesIO(data))

office_oxide

import requests
from office_oxide import Document

data = requests.get(url).content
with Document.from_bytes(data, "docx") as doc:
    print(doc.plain_text())

Server / batch processing

Tika — usually run in tika-server mode behind HTTP.

java -jar tika-server.jar -h 0.0.0.0 -p 9998

import requests
text = requests.put("http://localhost:9998/tika",
                     data=open("report.docx", "rb"),
                     headers={"Accept": "text/plain"}).text

office_oxide — drop the JVM and the server, just call the library directly. If you need a sidecar architecture (language-agnostic clients), use the MCP server or the CLI over stdio.

Side-by-side — JVM users

If your pipeline is Java/Kotlin/Scala and you don’t want to drop the JVM:

Keep Tika for everything that isn’t Office.
Call office-oxide for Office formats. Two options:
- JNA / JNR-FFI against liboffice_oxide and the C header at include/office_oxide_c/office_oxide.h. Same C API used by the Go and C# bindings.
- office_oxide_cli sidecar via ProcessBuilder. Stream input over stdin, read output over stdout. Trivially restartable, isolates crashes.

Either is faster than running Tika for Office formats — and avoids JVM-on-JVM weirdness.

What you get vs. Tika

	Tika	Office Oxide
DOCX, XLSX, PPTX	✓	✓
Legacy DOC, XLS, PPT	✓	✓
PDF, EPUB, RTF, ODT, etc.	✓	✗ (use pdf_oxide for PDF)
Plain text extraction	✓	✓
Markdown output	partial	✓ (built-in `to_markdown`)
Structured IR / JSON	XHTML SAX events	✓ (typed `DocumentIR`)
Find-and-replace templating	✗	✓ (`EditableDocument`)
Cell writes (XLSX)	✗	✓ (`set_cell`)
Legacy → modern conversion	✗	✓ (`save_as`)
JVM required	✓	✗
Native speed	JVM overhead	<1 ms per file

Performance (Office formats only)

A million-document Office ingestion (mix of DOCX, XLSX, PPTX, DOC, XLS, PPT) measured against tika-server:

Pipeline	Wall time	Notes
tika-server (REST), 8 workers	~3 h 40 m	Includes HTTP overhead
tika-app (in-process JVM), 8 workers	~1 h 50 m	Best-case Tika
office_oxide, 8 workers	~3 m	Native parsing

Numbers vary with the format mix; for ingestion-heavy Office workloads the gap is usually 30–60×.