What is the fastest Python library for DOCX, XLSX, and PPTX?

Office Oxide is the fastest. DOCX text extraction averages 0.8ms (vs 11.8ms for python-docx — 14× faster). XLSX averages 5.0ms (vs 94.5ms for openpyxl — 18× faster). PPTX averages 0.7ms (vs 32.5ms for python-pptx — 46× faster). Benchmarked on 6,062 real-world files.

Is Office Oxide free for commercial use?

Yes. Office Oxide is dual-licensed MIT OR Apache-2.0 — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL or copyleft restrictions.

Does Office Oxide handle legacy .doc, .xls, and .ppt files?

Yes. Office Oxide reads all six formats: DOCX, XLSX, PPTX, plus legacy DOC, XLS, PPT. It is the only Rust or Python library that supports all three legacy formats without a JVM (Apache Tika) or external binaries (catdoc, antiword).

Can Office Oxide convert documents to Markdown?

Yes. Every supported format has built-in to_markdown() that preserves headings, tables, lists, and structure — ideal for LLM and RAG pipelines. No separate package needed.

How does Office Oxide compare to calamine and openpyxl for XLSX?

On 1,802 XLSX files: Office Oxide averages 5.0ms (97.8% pass rate). python-calamine averages 13.9ms (96.6%). openpyxl averages 94.5ms (96.2%). Office Oxide is 2.8× faster than calamine and 18× faster than openpyxl, with the highest pass rate.

Does Office Oxide work in the browser?

Yes. Office Oxide ships a WASM build (office-oxide-wasm on npm) that runs in any browser or bundler. Process Office documents client-side with no server round-trips — useful for privacy-sensitive workloads.

Конвертация Office-документов в Markdown

У каждого handle Office Oxide есть метод to_markdown(), который выдаёт GitHub-flavored Markdown — заголовки, таблицы, списки и блоки, похожие на код — из любого из шести поддерживаемых форматов. Это правильная точка входа для большинства LLM и RAG-пайплайнов.

Разовый вызов

Rust

use office_oxide::to_markdown;

let md = to_markdown("report.docx")?;
std::fs::write("report.md", md)?;

Python

import office_oxide

md = office_oxide.to_markdown("report.docx")
open("report.md", "w").write(md)

JavaScript

import { toMarkdown } from 'office-oxide';
import { writeFileSync } from 'node:fs';

writeFileSync('report.md', toMarkdown('report.docx'));

md, err := officeoxide.ToMarkdown("report.docx")
os.WriteFile("report.md", []byte(md), 0o644)

File.WriteAllText("report.md", OfficeOxide.ToMarkdown("report.docx"));

int err = 0;
char *md = office_to_markdown("report.docx", &err);   /* open + render in one call */
if (md) {
    FILE *f = fopen("report.md", "w");
    fputs(md, f);
    fclose(f);
    office_oxide_free_string(md);
}

Переиспользуемый handle

Rust

let doc = office_oxide::Document::open("deck.pptx")?;
let md = doc.to_markdown();

Python

from office_oxide import Document

with Document.open("deck.pptx") as doc:
    md = doc.to_markdown()

JavaScript

using doc = Document.open('deck.pptx');
const md = doc.toMarkdown();

int err = 0;
OfficeDocumentHandle *doc = office_document_open("deck.pptx", &err);
if (doc) {
    char *md = office_document_to_markdown(doc, &err);
    if (md) { /* use md */ office_oxide_free_string(md); }
    office_document_free(doc);
}

WASM

import { WasmDocument } from 'office-oxide-wasm';

// WASM has no file I/O — read the bytes yourself, then open from bytes
const data = new Uint8Array(await (await fetch('/deck.pptx')).arrayBuffer());
using doc = new WasmDocument(data, 'pptx');
const md = doc.toMarkdown();

Что попадёт в вывод

Исходный элемент	Markdown
Заголовок DOCX (`<w:pStyle w:val="Heading1"/>` …)	`# Heading` (уровень соответствует стилю)
Абзац DOCX	Один параграф, soft-hyphen вырезается
Пункт списка DOCX	`- item` или `1. item` (нумерация сохраняется)
Таблица DOCX	GFM pipe-таблица
Лист XLSX	`## Sheet name` + pipe-таблица на каждый диапазон
Объединённые ячейки XLSX	Содержимое первой ячейки, span отбрасывается
Слайд PPTX	`## Slide N` + body, заметки приклеиваются как blockquote
Таблица PPTX	GFM pipe-таблица inline в слайде
Гиперссылки	`[text](url)`
Изображения	плейсхолдер `![alt](filename)` — см. «Изображения» ниже

Изображения

to_markdown() выдаёт плейсхолдеры с именем файла (например, ![](image1.png)), но не извлекает байты изображений — Markdown это текстовый формат. Чтобы вытащить изображения, используйте IR или формат-специфичный доступ:

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()
    for section in ir["sections"]:
        for el in section["elements"]:
            if el["kind"] == "Image":
                print(el["filename"], len(el["data"]))

Полная схема — IR-извлечение.

Сценарии применения

RAG-ингест — Markdown самый дружественный к LLM формат. Один проход на документ, детерминированная структура, без HTML-шума.
Индексация документов — заголовки задают естественные границы чанков, таблицы остаются queryable.
Миграции — DOCX → Markdown для статических генераторов (Hugo, Astro, MkDocs).
Diff контента — Markdown-диффы куда читаемее бинарных .docx-диффов.

Производительность

to_markdown() стоит примерно того же порядка, что и plain_text() — обычно 1–2× на медианном документе. Полные цифры — в разделе Производительность.

Смотрите также

HTML-извлечение — когда нужен стилизованный вывод
IR-извлечение — структурированный JSON для более сложных пайплайнов
PDF for RAG — для PDF используйте библиотеку-компаньон pdf_oxide