What is the fastest Python library for DOCX, XLSX, and PPTX?

Office Oxide is the fastest. DOCX text extraction averages 0.8ms (vs 11.8ms for python-docx — 14× faster). XLSX averages 5.0ms (vs 94.5ms for openpyxl — 18× faster). PPTX averages 0.7ms (vs 32.5ms for python-pptx — 46× faster). Benchmarked on 6,062 real-world files.

Is Office Oxide free for commercial use?

Yes. Office Oxide is dual-licensed MIT OR Apache-2.0 — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL or copyleft restrictions.

Does Office Oxide handle legacy .doc, .xls, and .ppt files?

Yes. Office Oxide reads all six formats: DOCX, XLSX, PPTX, plus legacy DOC, XLS, PPT. It is the only Rust or Python library that supports all three legacy formats without a JVM (Apache Tika) or external binaries (catdoc, antiword).

Can Office Oxide convert documents to Markdown?

Yes. Every supported format has built-in to_markdown() that preserves headings, tables, lists, and structure — ideal for LLM and RAG pipelines. No separate package needed.

How does Office Oxide compare to calamine and openpyxl for XLSX?

On 1,802 XLSX files: Office Oxide averages 5.0ms (97.8% pass rate). python-calamine averages 13.9ms (96.6%). openpyxl averages 94.5ms (96.2%). Office Oxide is 2.8× faster than calamine and 18× faster than openpyxl, with the highest pass rate.

Does Office Oxide work in the browser?

Yes. Office Oxide ships a WASM build (office-oxide-wasm on npm) that runs in any browser or bundler. Process Office documents client-side with no server round-trips — useful for privacy-sensitive workloads.

Извлечение таблиц

Office Oxide трактует таблицы как first-class IR-элементы: каждый <w:tbl> в DOCX, каждый диапазон в XLSX и каждый <a:tbl> в PPTX возвращается как типизированный Table { rows: [[ячейка, ...]] }. Один цикл — все три формата.

Пройти все таблицы в документе

Python

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    for el in section["elements"]:
        if el["kind"] == "Table":
            for row in el["rows"]:
                print(row)

Rust

use office_oxide::Document;
use office_oxide::ir::Element;

let doc = Document::open("report.docx")?;
let ir = doc.to_ir();

for section in &ir.sections {
    for el in &section.elements {
        if let Element::Table(t) = el {
            for row in &t.rows {
                println!("{row:?}");
            }
        }
    }
}

JavaScript

using doc = Document.open('report.docx');
const ir = doc.toIr();

for (const section of ir.sections) {
  for (const el of section.elements) {
    if (el.kind === 'Table') {
      for (const row of el.rows) {
        console.log(row);
      }
    }
  }
}

Go

doc, err := officeoxide.Open("report.docx")
if err != nil { log.Fatal(err) }
defer doc.Close()

irJSON, _ := doc.ToIRJSON()
var ir struct {
    Sections []struct {
        Elements []struct {
            Kind string     `json:"kind"`
            Rows [][]string `json:"rows"`
        } `json:"elements"`
    } `json:"sections"`
}
json.Unmarshal([]byte(irJSON), &ir)

for _, section := range ir.Sections {
    for _, el := range section.Elements {
        if el.Kind == "Table" {
            for _, row := range el.Rows {
                fmt.Println(row)
            }
        }
    }
}

C#

using OfficeOxide;
using System.Text.Json;

using var doc = Document.Open("report.docx");
using var ir = JsonDocument.Parse(doc.ToIrJson());

foreach (var section in ir.RootElement.GetProperty("sections").EnumerateArray())
{
    foreach (var el in section.GetProperty("elements").EnumerateArray())
    {
        if (el.GetProperty("kind").GetString() != "Table") continue;
        foreach (var row in el.GetProperty("rows").EnumerateArray())
        {
            Console.WriteLine(string.Join(" | ", row.EnumerateArray().Select(c => c.GetString())));
        }
    }
}

XLSX: одна таблица на диапазон листа

Для таблиц каждой секции соответствует лист, а таблицы — обнаруженным используемым диапазонам. Пустые ячейки идут пустой строкой; объединённые ячейки развёрнуты в значение верхней-левой, остальные пустые.

Python

import csv
from office_oxide import Document

with Document.open("budget.xlsx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    sheet_name = section.get("title", "Sheet")
    out_path = f"{sheet_name}.csv"
    with open(out_path, "w", newline="") as f:
        w = csv.writer(f)
        for el in section["elements"]:
            if el["kind"] == "Table":
                for row in el["rows"]:
                    w.writerow(row)

Если нужен богатый доступ по ячейкам (формулы, объединения, именованные диапазоны), уходите в формат-специфичный модуль:

with Document.open("budget.xlsx") as doc:
    xlsx = doc.as_xlsx()
    for sheet in xlsx.sheets():
        print(sheet.name(), sheet.dimensions())

DOCX: таблицы вперемешку с абзацами

IR сохраняет исходный порядок абзацев и таблиц — можно восстановить поток:

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    for el in section["elements"]:
        if el["kind"] == "Heading":
            print(f"\n## {el['text']}")
        elif el["kind"] == "Paragraph":
            print(" ".join(r["text"] for r in el["runs"]))
        elif el["kind"] == "Table":
            for row in el["rows"]:
                print("|", " | ".join(row), "|")

PPTX: таблицы внутри слайдовых секций

Каждый слайд — своя секция. Итерируйте секции, чтобы восстановить контекст слайд за слайдом:

with Document.open("deck.pptx") as doc:
    ir = doc.to_ir()

for i, section in enumerate(ir["sections"], 1):
    for el in section["elements"]:
        if el["kind"] == "Table":
            print(f"слайд {i}: таблица {len(el['rows'])}×{len(el['rows'][0])}")

Когда нужны типы ячеек, а не строки

Табличное представление IR схлопывает ячейки в строки. Чтобы отличать число от текста или булева в XLSX — используйте формат-специфичный аксессор:

with Document.open("budget.xlsx") as doc:
    xlsx = doc.as_xlsx()
    for sheet in xlsx.sheets():
        for cell in sheet.cells():
            print(cell.address(), cell.value(), cell.value_type())

Смотрите также

Структурированный IR — полная схема
Извлечение Markdown — GFM pipe-таблицы из коробки
Запись ячеек XLSX — обратная запись в таблицы