What is the fastest Python library for DOCX, XLSX, and PPTX?

Office Oxide is the fastest. DOCX text extraction averages 0.8ms (vs 11.8ms for python-docx — 14× faster). XLSX averages 5.0ms (vs 94.5ms for openpyxl — 18× faster). PPTX averages 0.7ms (vs 32.5ms for python-pptx — 46× faster). Benchmarked on 6,062 real-world files.

Is Office Oxide free for commercial use?

Yes. Office Oxide is dual-licensed MIT OR Apache-2.0 — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL or copyleft restrictions.

Does Office Oxide handle legacy .doc, .xls, and .ppt files?

Yes. Office Oxide reads all six formats: DOCX, XLSX, PPTX, plus legacy DOC, XLS, PPT. It is the only Rust or Python library that supports all three legacy formats without a JVM (Apache Tika) or external binaries (catdoc, antiword).

Can Office Oxide convert documents to Markdown?

Yes. Every supported format has built-in to_markdown() that preserves headings, tables, lists, and structure — ideal for LLM and RAG pipelines. No separate package needed.

How does Office Oxide compare to calamine and openpyxl for XLSX?

On 1,802 XLSX files: Office Oxide averages 5.0ms (97.8% pass rate). python-calamine averages 13.9ms (96.6%). openpyxl averages 94.5ms (96.2%). Office Oxide is 2.8× faster than calamine and 18× faster than openpyxl, with the highest pass rate.

Does Office Oxide work in the browser?

Yes. Office Oxide ships a WASM build (office-oxide-wasm on npm) that runs in any browser or bundler. Process Office documents client-side with no server round-trips — useful for privacy-sensitive workloads.

Rust-библиотека для Office — быстрый старт

office_oxide — это чисто-Rust крейт для разбора, конвертации и редактирования документов Office: DOCX, XLSX, PPTX плюс их legacy-предшественники DOC, XLS и PPT. Один крейт, единый Document-handle, никаких нативных зависимостей.

Установка

[dependencies]
office_oxide = "0.1.0"

Опциональные фичи:

office_oxide = { version = "0.1.0", features = ["mmap"] }       # mmap-открытие больших OOXML
office_oxide = { version = "0.1.0", features = ["parallel"] }   # хелперы параллельного парсинга на rayon

Прочитать документ

use office_oxide::Document;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let doc = Document::open("report.docx")?;
    println!("{}", doc.plain_text());
    Ok(())
}

Или одноразовый помощник:

let text = office_oxide::extract_text("report.docx")?;

Базовый API

Document-handle ведёт себя одинаково для всех форматов — определение по расширению плюс проверка magic-байтами.

use office_oxide::{Document, DocumentFormat};

let doc = Document::open("file.xlsx")?;
assert_eq!(doc.format(), DocumentFormat::Xlsx);

let plain = doc.plain_text();
let md    = doc.to_markdown();
let html  = doc.to_html();
let ir    = doc.to_ir();             // формат-независимый IR

doc.save_as("file.docx")?;            // legacy → OOXML тоже работает

Document::open принимает AsRef<Path>; Document::from_reader берёт Read + Seek + Send + 'static и явный DocumentFormat.

Шорткаты на уровне модуля:

let text = office_oxide::extract_text("file.docx")?;
let md   = office_oxide::to_markdown("file.pptx")?;
let html = office_oxide::to_html("file.xlsx")?;

Доступ к специфике формата

Когда нужны данные конкретного формата — листы, слайды, ячейки таблиц — разверните внутренний документ:

if let Some(xlsx) = doc.as_xlsx() {
    for sheet in xlsx.sheets() {
        println!("sheet: {}", sheet.name());
    }
}

То же работает для as_docx, as_pptx, as_doc, as_xls, as_ppt.

Редактирование

EditableDocument выполняет read-modify-write, сохраняя дословно все нетронутые OPC-части (изображения, диаграммы, стили, отношения). Поддерживаются только DOCX, XLSX, PPTX.

use office_oxide::edit::EditableDocument;

let mut doc = EditableDocument::open("template.docx")?;
let n = doc.replace_text("{{name}}", "Alice");
println!("{n} замен");
doc.save("out.docx")?;

replace_text обходит элементы <w:t> в DOCX и <a:t> в PPTX, возвращает количество замен (для XLSX вернёт 0 — используйте set_cell).

Запись ячеек XLSX

use office_oxide::edit::EditableDocument;
use office_oxide::xlsx::edit::CellValue;

let mut wb = EditableDocument::open("budget.xlsx")?;
wb.set_cell(0, "B2", CellValue::Number(42.0))?;
wb.set_cell(0, "A1", CellValue::String("Total".into()))?;
wb.set_cell(0, "C1", CellValue::Boolean(true))?;
wb.set_cell(0, "D1", CellValue::Empty)?;
wb.save("budget.xlsx")?;

Индексы листов нумеруются с нуля; адреса ячеек — стандартная нотация (A1, AA12).

Формат-независимый IR

DocumentIR — структурный мост между форматами. На нём держатся to_html, save_as и legacy-конверсия. Реализует Serialize / Deserialize, поэтому годится для эмиссии JSON.

let legacy = Document::open("old.doc")?;
legacy.save_as("migrated.docx")?;     // CFB → OOXML в одну строку

Открытие из байтов

use std::io::Cursor;
use office_oxide::{Document, DocumentFormat};

let bytes = std::fs::read("file.pptx")?;
let doc = Document::from_reader(Cursor::new(bytes), DocumentFormat::Pptx)?;

Memory-mapped открытие

С фичей mmap функция Document::open_mmap не копирует крупные OOXML-файлы в кучу:

let doc = Document::open_mmap("huge.xlsx")?;

mmap доступен только для DOCX/XLSX/PPTX; legacy-парсеры CFB требуют owned-буферов.

Ошибки

Все ошибки возвращаются через office_oxide::Result<T>, то есть Result<T, OfficeError>. Перечисление покрывает IO, парсинг, неподдерживаемый формат и неудачи извлечения.

use office_oxide::{Document, OfficeError};

match Document::open("weird.file") {
    Ok(doc) => println!("{}", doc.plain_text()),
    Err(OfficeError::UnsupportedFormat(ext)) => eprintln!("не получится открыть .{ext}"),
    Err(e) => eprintln!("ошибка: {e}"),
}

Диагностика

Симптом	Причина
`UnsupportedFormat("(none)")`	У пути нет расширения — открывайте через `from_reader` с явным `DocumentFormat`.
Каша в тексте DOC	Файл зашифрован или использует редкую кодировку piece-table. Проверьте magic CFB `D0 CF 11 E0`.
Отсутствуют гиперссылки в DOCX	Ссылки разрешаются через `w:rels`. Убедитесь, что `.rels`-сосед лежит в ZIP.
Stack overflow на маленьких стеках потоков	`office_oxide` поднимает 16 МБ парс-поток, если `RLIMIT_STACK < 12 МБ`. В своих пулах задайте `Builder::stack_size(16 * 1024 * 1024)`.

Смотрите также

Справочник API — единый кросс-языковой справочник по всем 7 привязкам
Другие языки: Python, Node.js, WASM, C#, Golang, C
Быстрый старт Python — тот же API в Python
Бенчмарки производительности — полные числа по 6 062 файлам
Архитектура: ARCHITECTURE.md
Крейт на crates.io, документация на docs.rs