What is the fastest Python library for DOCX, XLSX, and PPTX?

Office Oxide is the fastest. DOCX text extraction averages 0.8ms (vs 11.8ms for python-docx — 14× faster). XLSX averages 5.0ms (vs 94.5ms for openpyxl — 18× faster). PPTX averages 0.7ms (vs 32.5ms for python-pptx — 46× faster). Benchmarked on 6,062 real-world files.

Is Office Oxide free for commercial use?

Yes. Office Oxide is dual-licensed MIT OR Apache-2.0 — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL or copyleft restrictions.

Does Office Oxide handle legacy .doc, .xls, and .ppt files?

Yes. Office Oxide reads all six formats: DOCX, XLSX, PPTX, plus legacy DOC, XLS, PPT. It is the only Rust or Python library that supports all three legacy formats without a JVM (Apache Tika) or external binaries (catdoc, antiword).

Can Office Oxide convert documents to Markdown?

Yes. Every supported format has built-in to_markdown() that preserves headings, tables, lists, and structure — ideal for LLM and RAG pipelines. No separate package needed.

How does Office Oxide compare to calamine and openpyxl for XLSX?

On 1,802 XLSX files: Office Oxide averages 5.0ms (97.8% pass rate). python-calamine averages 13.9ms (96.6%). openpyxl averages 94.5ms (96.2%). Office Oxide is 2.8× faster than calamine and 18× faster than openpyxl, with the highest pass rate.

Does Office Oxide work in the browser?

Yes. Office Oxide ships a WASM build (office-oxide-wasm on npm) that runs in any browser or bundler. Process Office documents client-side with no server round-trips — useful for privacy-sensitive workloads.

Rust-Bibliothek für Office — Schnellstart

office_oxide ist eine reine Rust-Crate zum Parsen, Konvertieren und Bearbeiten von Office-Dokumenten: DOCX, XLSX, PPTX sowie deren binäre Vorgänger DOC, XLS und PPT. Eine Crate, ein einheitliches Document-Handle, keine nativen Abhängigkeiten.

Installation

[dependencies]
office_oxide = "0.1.0"

Optionale Features:

office_oxide = { version = "0.1.0", features = ["mmap"] }       # mmap-Öffnen großer OOXML-Dateien
office_oxide = { version = "0.1.0", features = ["parallel"] }   # Hilfsmittel für paralleles Parsen via rayon

Ein Dokument lesen

use office_oxide::Document;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let doc = Document::open("report.docx")?;
    println!("{}", doc.plain_text());
    Ok(())
}

Oder der Einzelaufruf-Helfer:

let text = office_oxide::extract_text("report.docx")?;

Kern-API

Das Document-Handle verhält sich für alle Formate gleich — Erkennung über die Erweiterung plus Verifizierung per Magic-Bytes.

use office_oxide::{Document, DocumentFormat};

let doc = Document::open("file.xlsx")?;
assert_eq!(doc.format(), DocumentFormat::Xlsx);

let plain = doc.plain_text();
let md    = doc.to_markdown();
let html  = doc.to_html();
let ir    = doc.to_ir();             // formatunabhängige IR

doc.save_as("file.docx")?;            // legacy → OOXML klappt auch

Document::open nimmt AsRef<Path>, Document::from_reader braucht Read + Seek + Send + 'static und ein explizites DocumentFormat.

Modul-Shortcuts:

let text = office_oxide::extract_text("file.docx")?;
let md   = office_oxide::to_markdown("file.pptx")?;
let html = office_oxide::to_html("file.xlsx")?;

Formatspezifischer Zugriff

Wenn du formatspezifische Daten brauchst — Tabellenblätter, Folien, Tabellenzellen — pack das innere Dokument aus:

if let Some(xlsx) = doc.as_xlsx() {
    for sheet in xlsx.sheets() {
        println!("sheet: {}", sheet.name());
    }
}

Das gleiche Muster gilt für as_docx, as_pptx, as_doc, as_xls, as_ppt.

Bearbeiten

EditableDocument macht read-modify-write und bewahrt dabei alle unveränderten OPC-Teile (Bilder, Diagramme, Stile, Beziehungen). Unterstützt sind nur DOCX, XLSX und PPTX.

use office_oxide::edit::EditableDocument;

let mut doc = EditableDocument::open("template.docx")?;
let n = doc.replace_text("{{name}}", "Alice");
println!("{n} Ersetzungen");
doc.save("out.docx")?;

replace_text läuft in DOCX über <w:t>-Elemente und in PPTX über <a:t>. Es liefert die Zahl der Ersetzungen (für XLSX 0 — dort set_cell benutzen).

XLSX-Zellen setzen

use office_oxide::edit::EditableDocument;
use office_oxide::xlsx::edit::CellValue;

let mut wb = EditableDocument::open("budget.xlsx")?;
wb.set_cell(0, "B2", CellValue::Number(42.0))?;
wb.set_cell(0, "A1", CellValue::String("Total".into()))?;
wb.set_cell(0, "C1", CellValue::Boolean(true))?;
wb.set_cell(0, "D1", CellValue::Empty)?;
wb.save("budget.xlsx")?;

Sheet-Indizes sind nullbasiert; Zellreferenzen folgen der Standard-Notation (A1, AA12).

Formatunabhängige IR

DocumentIR ist die strukturelle Brücke zwischen Formaten — sie trägt to_html, save_as und die Legacy-Konvertierung. Sie implementiert Serialize / Deserialize, also kannst du JSON für nachgelagertes Tooling rausschreiben.

let legacy = Document::open("old.doc")?;
legacy.save_as("migrated.docx")?;     // CFB → OOXML in einer Zeile

Aus Bytes öffnen

use std::io::Cursor;
use office_oxide::{Document, DocumentFormat};

let bytes = std::fs::read("file.pptx")?;
let doc = Document::from_reader(Cursor::new(bytes), DocumentFormat::Pptx)?;

Memory-mapped Öffnen

Mit dem Feature mmap kopiert Document::open_mmap große OOXML-Dateien nicht in den Heap:

let doc = Document::open_mmap("huge.xlsx")?;

mmap funktioniert nur für DOCX/XLSX/PPTX; die Legacy-CFB-Parser brauchen Owned-Buffer.

Fehler

Alle fehlbaren Einstiegspunkte liefern office_oxide::Result<T>, also Result<T, OfficeError>. Das Enum deckt IO, Parsing, nicht unterstützte Formate und Extraktionsfehler ab.

use office_oxide::{Document, OfficeError};

match Document::open("weird.file") {
    Ok(doc) => println!("{}", doc.plain_text()),
    Err(OfficeError::UnsupportedFormat(ext)) => eprintln!(".{ext} kann nicht geöffnet werden"),
    Err(e) => eprintln!("fehlgeschlagen: {e}"),
}

Fehlersuche

Symptom	Wahrscheinliche Ursache
`UnsupportedFormat("(none)")`	Pfad ohne Erweiterung — über `from_reader` mit explizitem `DocumentFormat` öffnen.
Wirrer DOC-Text	Datei evtl. verschlüsselt oder mit ungewöhnlicher Piece-Table-Kodierung. CFB-Magic `D0 CF 11 E0` prüfen.
Fehlende Hyperlinks im DOCX	Hyperlinks werden über `w:rels` aufgelöst. Sieh nach, ob die `.rels`-Datei im ZIP steckt.
Stack-Overflow auf kleinen Stack-Threads	`office_oxide` startet einen 16-MB-Parsing-Thread, wenn `RLIMIT_STACK < 12 MB`. In eigenen Pools mit `Builder::stack_size(16 * 1024 * 1024)` arbeiten.

Siehe auch

API-Referenz — einheitliche sprachübergreifende Referenz für alle 7 Bindings
Andere Sprachen: Python, Node.js, WASM, C#, Golang und C
Python-Schnellstart — dieselbe API in Python
Performance-Benchmarks — vollständige Zahlen über 6.062 Dateien
Architektur: ARCHITECTURE.md
Crate auf crates.io, Docs auf docs.rs