Office Documents for RAG
Retrieval-augmented generation pipelines live and die by ingestion quality. Office Oxide gives you three outputs (text, Markdown, structured IR) that map cleanly onto the three things RAG pipelines need: a body for embedding, a structure for chunking, and metadata for citations.
Pick your output
| Goal | Use |
|---|---|
| Cheapest embeddings, lowest token cost | plain_text() |
| Structure-preserving chunks (best retrieval quality) | to_markdown() |
| Programmatic chunk + cite by section/slide/cell | to_ir() |
For most projects, to_markdown() is the sweet spot: it preserves headings (so you get natural chunk boundaries), keeps tables queryable, and is small enough to embed without exploding token counts.
Heading-aware chunking from Markdown
The Markdown output uses # / ## / ### for source headings. Split there and you get semantically coherent chunks “for free.”
from office_oxide import Document
def chunk_by_heading(md: str, level: int = 2):
chunks, current = [], []
for line in md.splitlines():
if line.startswith("#" * level + " "):
if current:
chunks.append("\n".join(current))
current = [line]
else:
current.append(line)
if current:
chunks.append("\n".join(current))
return chunks
with Document.open("report.docx") as doc:
md = doc.to_markdown()
chunks = chunk_by_heading(md, level=2)
for c in chunks:
print(len(c), c[:60].replace("\n", " "))
IR-based chunking for citation accuracy
If you need to cite slide 3 or sheet “Q4 Forecast” in your retrieved context, walk the IR. Each section carries the natural locator:
from office_oxide import Document
with Document.open("deck.pptx") as doc:
ir = doc.to_ir()
chunks = []
for i, section in enumerate(ir["sections"], 1):
title = section.get("title") or f"Slide {i}"
body = []
for el in section["elements"]:
if el["kind"] == "Heading":
body.append("# " + el["text"])
elif el["kind"] == "Paragraph":
body.append(" ".join(r["text"] for r in el["runs"]))
elif el["kind"] == "Table":
for row in el["rows"]:
body.append(" | ".join(row))
chunks.append({
"source": "deck.pptx",
"locator": f"slide:{i}",
"title": title,
"text": "\n".join(body),
})
Now your retrieved chunks have a precise locator (slide:3 / sheet:Q4 Forecast / section:2) for citations.
LangChain integration
from langchain_core.documents import Document as LCDoc
from office_oxide import Document
def load_office(path: str) -> list[LCDoc]:
with Document.open(path) as doc:
ir = doc.to_ir()
out = []
for i, section in enumerate(ir["sections"], 1):
body_lines = []
for el in section["elements"]:
if el["kind"] == "Paragraph":
body_lines.append(" ".join(r["text"] for r in el["runs"]))
elif el["kind"] == "Heading":
body_lines.append(el["text"])
if not body_lines:
continue
out.append(LCDoc(
page_content="\n".join(body_lines),
metadata={
"source": path,
"section_index": i,
"section_title": section.get("title"),
},
))
return out
docs = load_office("report.docx")
Drop into Chroma.from_documents(docs, embedder) (or any vectorstore) as usual.
LlamaIndex integration
from llama_index.core import Document as LIDoc
from office_oxide import Document
def load_office(path: str) -> list[LIDoc]:
with Document.open(path) as doc:
md = doc.to_markdown()
return [LIDoc(text=md, metadata={"source": path})]
For per-section nodes, use the IR-based pattern above and pass each chunk as a separate LIDoc.
Tables — the hard part
LLMs handle small tables well in Markdown form. Big tables (50+ rows) are better summarized or paginated:
def summarize_table(rows: list[list[str]]) -> str:
headers = rows[0]
body = rows[1:]
return f"Table with columns {headers} and {len(body)} rows. Sample: {body[:3]}"
For dashboards (XLSX), consider extracting per-sheet summaries rather than full cell dumps — the LLM benefits more from “Sheet ‘Q4’ totals revenue $4.2M across 12 regions” than from 5,000 cell values.
Performance & cost
| Op | Time per file (DOCX, median) | Notes |
|---|---|---|
plain_text() |
0.8 ms | cheapest |
to_markdown() |
~1.5 ms | recommended for RAG |
to_ir() |
~1.2 ms | when you need structure |
A million-document corpus extracts in ~25 minutes single-threaded, ~3 minutes on 8 cores. The dominant cost in your RAG pipeline will be embedding API calls, not Office parsing.
See also
- Markdown extraction — full output spec
- Structured IR — schema for citation-aware chunking
- Batch processing — parallelism patterns