What is the fastest Python library for DOCX, XLSX, and PPTX?

Office Oxide is the fastest. DOCX text extraction averages 0.8ms (vs 11.8ms for python-docx — 14× faster). XLSX averages 5.0ms (vs 94.5ms for openpyxl — 18× faster). PPTX averages 0.7ms (vs 32.5ms for python-pptx — 46× faster). Benchmarked on 6,062 real-world files.

Is Office Oxide free for commercial use?

Yes. Office Oxide is dual-licensed MIT OR Apache-2.0 — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL or copyleft restrictions.

Does Office Oxide handle legacy .doc, .xls, and .ppt files?

Yes. Office Oxide reads all six formats: DOCX, XLSX, PPTX, plus legacy DOC, XLS, PPT. It is the only Rust or Python library that supports all three legacy formats without a JVM (Apache Tika) or external binaries (catdoc, antiword).

Can Office Oxide convert documents to Markdown?

Yes. Every supported format has built-in to_markdown() that preserves headings, tables, lists, and structure — ideal for LLM and RAG pipelines. No separate package needed.

How does Office Oxide compare to calamine and openpyxl for XLSX?

On 1,802 XLSX files: Office Oxide averages 5.0ms (97.8% pass rate). python-calamine averages 13.9ms (96.6%). openpyxl averages 94.5ms (96.2%). Office Oxide is 2.8× faster than calamine and 18× faster than openpyxl, with the highest pass rate.

Does Office Oxide work in the browser?

Yes. Office Oxide ships a WASM build (office-oxide-wasm on npm) that runs in any browser or bundler. Process Office documents client-side with no server round-trips — useful for privacy-sensitive workloads.

Видобування таблиць

Office Oxide трактує таблиці як first-class IR-елементи: кожен <w:tbl> у DOCX, кожен діапазон у XLSX і кожен <a:tbl> у PPTX повертається як типізований Table { rows: [[клітинка, ...]] }. Один цикл — усі три формати.

Пройти всі таблиці в документі

Python

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    for el in section["elements"]:
        if el["kind"] == "Table":
            for row in el["rows"]:
                print(row)

Rust

use office_oxide::Document;
use office_oxide::ir::Element;

let doc = Document::open("report.docx")?;
let ir = doc.to_ir();

for section in &ir.sections {
    for el in &section.elements {
        if let Element::Table(t) = el {
            for row in &t.rows {
                println!("{row:?}");
            }
        }
    }
}

JavaScript

using doc = Document.open('report.docx');
const ir = doc.toIr();

for (const section of ir.sections) {
  for (const el of section.elements) {
    if (el.kind === 'Table') {
      for (const row of el.rows) {
        console.log(row);
      }
    }
  }
}

Go

doc, err := officeoxide.Open("report.docx")
if err != nil { log.Fatal(err) }
defer doc.Close()

irJSON, _ := doc.ToIRJSON()
var ir struct {
    Sections []struct {
        Elements []struct {
            Kind string     `json:"kind"`
            Rows [][]string `json:"rows"`
        } `json:"elements"`
    } `json:"sections"`
}
json.Unmarshal([]byte(irJSON), &ir)

for _, section := range ir.Sections {
    for _, el := range section.Elements {
        if el.Kind == "Table" {
            for _, row := range el.Rows {
                fmt.Println(row)
            }
        }
    }
}

C#

using OfficeOxide;
using System.Text.Json;

using var doc = Document.Open("report.docx");
using var ir = JsonDocument.Parse(doc.ToIrJson());

foreach (var section in ir.RootElement.GetProperty("sections").EnumerateArray())
{
    foreach (var el in section.GetProperty("elements").EnumerateArray())
    {
        if (el.GetProperty("kind").GetString() != "Table") continue;
        foreach (var row in el.GetProperty("rows").EnumerateArray())
        {
            Console.WriteLine(string.Join(" | ", row.EnumerateArray().Select(c => c.GetString())));
        }
    }
}

XLSX: одна таблиця на діапазон листа

Для таблиць кожній секції відповідає лист, а таблиці мапляться на виявлені використовувані діапазони. Порожні клітинки виходять порожніми рядками; об’єднані клітинки розгортаються у значення верхньої-лівої, решта порожні.

Python

import csv
from office_oxide import Document

with Document.open("budget.xlsx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    sheet_name = section.get("title", "Sheet")
    out_path = f"{sheet_name}.csv"
    with open(out_path, "w", newline="") as f:
        w = csv.writer(f)
        for el in section["elements"]:
            if el["kind"] == "Table":
                for row in el["rows"]:
                    w.writerow(row)

Для багатшого доступу по клітинках (формули, об’єднання, іменовані діапазони) — у форматно-специфічний модуль:

with Document.open("budget.xlsx") as doc:
    xlsx = doc.as_xlsx()
    for sheet in xlsx.sheets():
        print(sheet.name(), sheet.dimensions())

DOCX: таблиці перемежовуються з абзацами

IR зберігає вихідний порядок абзаців і таблиць — можна відновити потік:

from office_oxide import Document

with Document.open("report.docx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    for el in section["elements"]:
        if el["kind"] == "Heading":
            print(f"\n## {el['text']}")
        elif el["kind"] == "Paragraph":
            print(" ".join(r["text"] for r in el["runs"]))
        elif el["kind"] == "Table":
            for row in el["rows"]:
                print("|", " | ".join(row), "|")

PPTX: таблиці всередині слайдових секцій

Кожен слайд — окрема секція. Ітеруйте секції, щоб відновити контекст слайд за слайдом:

with Document.open("deck.pptx") as doc:
    ir = doc.to_ir()

for i, section in enumerate(ir["sections"], 1):
    for el in section["elements"]:
        if el["kind"] == "Table":
            print(f"слайд {i}: таблиця {len(el['rows'])}×{len(el['rows'][0])}")

Коли потрібні типи клітинок, а не рядки

Табличне представлення IR схлопує клітинки до рядків. Щоб у XLSX розрізнити число від тексту чи булевого — використовуйте форматно-специфічний аксесор:

with Document.open("budget.xlsx") as doc:
    xlsx = doc.as_xlsx()
    for sheet in xlsx.sheets():
        for cell in sheet.cells():
            print(cell.address(), cell.value(), cell.value_type())

Дивіться також

Структурований IR — повна схема
Видобування Markdown — GFM pipe-таблиці з коробки
Запис клітинок XLSX — зворотний запис у таблиці