What is the fastest Python library for DOCX, XLSX, and PPTX?

Office Oxide is the fastest. DOCX text extraction averages 0.8ms (vs 11.8ms for python-docx — 14× faster). XLSX averages 5.0ms (vs 94.5ms for openpyxl — 18× faster). PPTX averages 0.7ms (vs 32.5ms for python-pptx — 46× faster). Benchmarked on 6,062 real-world files.

Is Office Oxide free for commercial use?

Yes. Office Oxide is dual-licensed MIT OR Apache-2.0 — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL or copyleft restrictions.

Does Office Oxide handle legacy .doc, .xls, and .ppt files?

Yes. Office Oxide reads all six formats: DOCX, XLSX, PPTX, plus legacy DOC, XLS, PPT. It is the only Rust or Python library that supports all three legacy formats without a JVM (Apache Tika) or external binaries (catdoc, antiword).

Can Office Oxide convert documents to Markdown?

Yes. Every supported format has built-in to_markdown() that preserves headings, tables, lists, and structure — ideal for LLM and RAG pipelines. No separate package needed.

How does Office Oxide compare to calamine and openpyxl for XLSX?

On 1,802 XLSX files: Office Oxide averages 5.0ms (97.8% pass rate). python-calamine averages 13.9ms (96.6%). openpyxl averages 94.5ms (96.2%). Office Oxide is 2.8× faster than calamine and 18× faster than openpyxl, with the highest pass rate.

Does Office Oxide work in the browser?

Yes. Office Oxide ships a WASM build (office-oxide-wasm on npm) that runs in any browser or bundler. Process Office documents client-side with no server round-trips — useful for privacy-sensitive workloads.

Office-документи для RAG

Retrieval-augmented generation-пайплайни живуть і вмирають за якістю інжесту. Office Oxide дає три виходи (текст, Markdown, структурований IR), що рівно лягають на три потреби RAG-пайплайна: тіло для embeddings, структура для чанкінгу, метадані для цитат.

Вибір виходу

Мета	Використовуйте
Найдешевші embeddings, мінімальна вартість токенів	`plain_text()`
Чанки зі збереженням структури (найкраща якість retrieval)	`to_markdown()`
Чанк + цитата за section/слайдом/клітинкою	`to_ir()`

Для більшості проєктів золота середина — to_markdown(): зберігає заголовки (природні межі чанків), таблиці лишаються queryable, розмір достатньо малий, щоб embed’ити без вибуху токенів.

Чанкінг за заголовками з Markdown

Markdown-вивід використовує # / ## / ### для початкових заголовків. Різання за ними дає семантично узгоджені чанки «безкоштовно».

from office_oxide import Document

def chunk_by_heading(md: str, level: int = 2):
    chunks, current = [], []
    for line in md.splitlines():
        if line.startswith("#" * level + " "):
            if current:
                chunks.append("\n".join(current))
            current = [line]
        else:
            current.append(line)
    if current:
        chunks.append("\n".join(current))
    return chunks

with Document.open("report.docx") as doc:
    md = doc.to_markdown()

chunks = chunk_by_heading(md, level=2)
for c in chunks:
    print(len(c), c[:60].replace("\n", " "))

Чанкінг через IR для точності цитат

Якщо у знайденому контексті треба цитувати слайд 3 або лист “Q4 Forecast” — ходіть по IR. Кожна секція несе нативний локатор:

from office_oxide import Document

with Document.open("deck.pptx") as doc:
    ir = doc.to_ir()

chunks = []
for i, section in enumerate(ir["sections"], 1):
    title = section.get("title") or f"Слайд {i}"
    body = []
    for el in section["elements"]:
        if el["kind"] == "Heading":
            body.append("# " + el["text"])
        elif el["kind"] == "Paragraph":
            body.append(" ".join(r["text"] for r in el["runs"]))
        elif el["kind"] == "Table":
            for row in el["rows"]:
                body.append(" | ".join(row))
    chunks.append({
        "source": "deck.pptx",
        "locator": f"slide:{i}",
        "title": title,
        "text": "\n".join(body),
    })

Тепер ваші retrieved-чанки мають точний локатор (slide:3 / sheet:Q4 Forecast / section:2) для цитат.

Інтеграція з LangChain

from langchain_core.documents import Document as LCDoc
from office_oxide import Document

def load_office(path: str) -> list[LCDoc]:
    with Document.open(path) as doc:
        ir = doc.to_ir()
    out = []
    for i, section in enumerate(ir["sections"], 1):
        body_lines = []
        for el in section["elements"]:
            if el["kind"] == "Paragraph":
                body_lines.append(" ".join(r["text"] for r in el["runs"]))
            elif el["kind"] == "Heading":
                body_lines.append(el["text"])
        if not body_lines:
            continue
        out.append(LCDoc(
            page_content="\n".join(body_lines),
            metadata={
                "source": path,
                "section_index": i,
                "section_title": section.get("title"),
            },
        ))
    return out

docs = load_office("report.docx")

Кидайте у Chroma.from_documents(docs, embedder) (або будь-який vectorstore) як завжди.

Інтеграція з LlamaIndex

from llama_index.core import Document as LIDoc
from office_oxide import Document

def load_office(path: str) -> list[LIDoc]:
    with Document.open(path) as doc:
        md = doc.to_markdown()
    return [LIDoc(text=md, metadata={"source": path})]

Для нод по секціях використовуйте патерн із IR вище і передавайте кожен чанк окремим LIDoc.

Таблиці — складна частина

LLM добре справляються з невеликими таблицями у форматі Markdown. Великі таблиці (50+ рядків) краще резюмувати або пагінувати:

def summarize_table(rows: list[list[str]]) -> str:
    headers = rows[0]
    body = rows[1:]
    return f"Таблиця з колонками {headers} та {len(body)} рядків. Зразок: {body[:3]}"

Для дашбордів (XLSX) подумайте про зведення на лист замість повного дампа клітинок — LLM більше виграє від «Лист ‘Q4’ сумує дохід $4.2M по 12 регіонах», ніж від 5 000 значень клітинок.

Продуктивність і вартість

Op	Час на файл (DOCX, медіана)	Примітки
`plain_text()`	0,8 мс	найдешевше
`to_markdown()`	~1,5 мс	рекомендовано для RAG
`to_ir()`	~1,2 мс	коли потрібна структура

Корпус з мільйона документів видобувається за ~25 хвилин single-thread, ~3 хвилини на 8 ядрах. Домінуюча вартість у вашому RAG-пайплайні буде за виклики embedding API, а не за парсинг Office.

Дивіться також

Видобування Markdown — повна специфікація виводу
Структурований IR — схема для citation-aware-чанкінгу
Пакетна обробка — патерни паралелізму