What is the fastest Python library for DOCX, XLSX, and PPTX?

Office Oxide is the fastest. DOCX text extraction averages 0.8ms (vs 11.8ms for python-docx — 14× faster). XLSX averages 5.0ms (vs 94.5ms for openpyxl — 18× faster). PPTX averages 0.7ms (vs 32.5ms for python-pptx — 46× faster). Benchmarked on 6,062 real-world files.

Is Office Oxide free for commercial use?

Yes. Office Oxide is dual-licensed MIT OR Apache-2.0 — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL or copyleft restrictions.

Does Office Oxide handle legacy .doc, .xls, and .ppt files?

Yes. Office Oxide reads all six formats: DOCX, XLSX, PPTX, plus legacy DOC, XLS, PPT. It is the only Rust or Python library that supports all three legacy formats without a JVM (Apache Tika) or external binaries (catdoc, antiword).

Can Office Oxide convert documents to Markdown?

Yes. Every supported format has built-in to_markdown() that preserves headings, tables, lists, and structure — ideal for LLM and RAG pipelines. No separate package needed.

How does Office Oxide compare to calamine and openpyxl for XLSX?

On 1,802 XLSX files: Office Oxide averages 5.0ms (97.8% pass rate). python-calamine averages 13.9ms (96.6%). openpyxl averages 94.5ms (96.2%). Office Oxide is 2.8× faster than calamine and 18× faster than openpyxl, with the highest pass rate.

Does Office Oxide work in the browser?

Yes. Office Oxide ships a WASM build (office-oxide-wasm on npm) that runs in any browser or bundler. Process Office documents client-side with no server round-trips — useful for privacy-sensitive workloads.

Office-документы для RAG

Retrieval-augmented generation-пайплайны живут и умирают на качестве ингеста. Office Oxide даёт три выхода (текст, Markdown, структурированный IR), которые ровно ложатся на три потребности RAG-пайплайна: тело для эмбеддингов, структуру для чанкинга, метаданные для цитат.

Выбор выхода

Цель	Используйте
Самые дешёвые эмбеддинги, минимальный token-cost	`plain_text()`
Чанки с сохранением структуры (лучшее качество retrieval)	`to_markdown()`
Чанк + цитата по section/слайду/ячейке	`to_ir()`

Для большинства проектов золотая середина — to_markdown(): сохраняет заголовки (естественные границы чанков), таблицы остаются queryable, размер достаточно мал, чтобы эмбеддить без взрыва токенов.

Чанкинг по заголовкам из Markdown

Вывод Markdown использует # / ## / ### для исходных заголовков. Резка по ним даёт семантически согласованные чанки «бесплатно».

from office_oxide import Document

def chunk_by_heading(md: str, level: int = 2):
    chunks, current = [], []
    for line in md.splitlines():
        if line.startswith("#" * level + " "):
            if current:
                chunks.append("\n".join(current))
            current = [line]
        else:
            current.append(line)
    if current:
        chunks.append("\n".join(current))
    return chunks

with Document.open("report.docx") as doc:
    md = doc.to_markdown()

chunks = chunk_by_heading(md, level=2)
for c in chunks:
    print(len(c), c[:60].replace("\n", " "))

Чанкинг через IR для точности цитат

Если в найденном контексте надо цитировать слайд 3 или лист “Q4 Forecast” — идите по IR. Каждая секция несёт нативный локатор:

from office_oxide import Document

with Document.open("deck.pptx") as doc:
    ir = doc.to_ir()

chunks = []
for i, section in enumerate(ir["sections"], 1):
    title = section.get("title") or f"Слайд {i}"
    body = []
    for el in section["elements"]:
        if el["kind"] == "Heading":
            body.append("# " + el["text"])
        elif el["kind"] == "Paragraph":
            body.append(" ".join(r["text"] for r in el["runs"]))
        elif el["kind"] == "Table":
            for row in el["rows"]:
                body.append(" | ".join(row))
    chunks.append({
        "source": "deck.pptx",
        "locator": f"slide:{i}",
        "title": title,
        "text": "\n".join(body),
    })

Теперь у retrieved-чанков есть точный локатор (slide:3 / sheet:Q4 Forecast / section:2) для цитат.

Интеграция с LangChain

from langchain_core.documents import Document as LCDoc
from office_oxide import Document

def load_office(path: str) -> list[LCDoc]:
    with Document.open(path) as doc:
        ir = doc.to_ir()
    out = []
    for i, section in enumerate(ir["sections"], 1):
        body_lines = []
        for el in section["elements"]:
            if el["kind"] == "Paragraph":
                body_lines.append(" ".join(r["text"] for r in el["runs"]))
            elif el["kind"] == "Heading":
                body_lines.append(el["text"])
        if not body_lines:
            continue
        out.append(LCDoc(
            page_content="\n".join(body_lines),
            metadata={
                "source": path,
                "section_index": i,
                "section_title": section.get("title"),
            },
        ))
    return out

docs = load_office("report.docx")

Закидывайте в Chroma.from_documents(docs, embedder) (или любой vectorstore) как обычно.

Интеграция с LlamaIndex

from llama_index.core import Document as LIDoc
from office_oxide import Document

def load_office(path: str) -> list[LIDoc]:
    with Document.open(path) as doc:
        md = doc.to_markdown()
    return [LIDoc(text=md, metadata={"source": path})]

Для нод по секциям используйте паттерн с IR выше и передавайте каждый чанк отдельным LIDoc.

Таблицы — сложная часть

LLM хорошо справляются с маленькими таблицами в Markdown. Большие таблицы (50+ строк) лучше суммировать или пагинировать:

def summarize_table(rows: list[list[str]]) -> str:
    headers = rows[0]
    body = rows[1:]
    return f"Таблица с колонками {headers} и {len(body)} строк. Образец: {body[:3]}"

Для дашбордов (XLSX) подумайте про сводки по листу вместо полного дампа ячеек — LLM получит больше пользы от «Лист ‘Q4’ суммирует выручку $4.2M по 12 регионам», чем от 5 000 значений ячеек.

Производительность и стоимость

Op	Время на файл (DOCX, медиана)	Заметки
`plain_text()`	0.8 мс	дешевле всего
`to_markdown()`	~1.5 мс	рекомендуется для RAG
`to_ir()`	~1.2 мс	когда нужна структура

Корпус из миллиона документов извлекается за ~25 минут на одном потоке, ~3 минуты на 8 ядрах. Доминирующая стоимость в вашем RAG-пайплайне будет за вызовами embedding API, а не за парсинг Office.

Смотрите также

Извлечение Markdown — полная спецификация вывода
Структурированный IR — схема для citation-aware-чанкинга
Пакетная обработка — паттерны параллелизма