Extract Text from Office Documents
Office Oxide gives you one entry point — extract_text() (or Document.open(...).plain_text()) — that works the same way across DOCX, XLSX, PPTX, DOC, XLS, and PPT. No format-specific code paths.
One-shot helper
The fastest path: a single function call that opens the file, runs the format-appropriate extractor, and returns a string.
Rust
use office_oxide::extract_text;
let text = extract_text("report.docx")?;
println!("{text}");
Python
import office_oxide
text = office_oxide.extract_text("report.docx")
print(text)
JavaScript
import { extractText } from 'office-oxide';
console.log(extractText('report.docx'));
Go
import officeoxide "github.com/yfedoseev/office_oxide/go"
text, err := officeoxide.ExtractText("report.docx")
C#
using OfficeOxide;
string text = OfficeOxide.ExtractText("report.docx");
Reusable handle
If you need text plus other outputs (Markdown, HTML, IR), open the document once and reuse the handle:
Rust
use office_oxide::Document;
let doc = Document::open("report.docx")?;
let text = doc.plain_text();
let md = doc.to_markdown();
Python
from office_oxide import Document
with Document.open("report.docx") as doc:
text = doc.plain_text()
md = doc.to_markdown()
JavaScript
import { Document } from 'office-oxide';
using doc = Document.open('report.docx');
const text = doc.plainText();
const md = doc.toMarkdown();
What you get per format
| Format | Output |
|---|---|
| DOCX | Body text in document order, plus headers and footers; soft hyphens stripped |
| XLSX | Cell values across every sheet, tab-separated within a row, blank line between sheets |
| PPTX | Slide title, body placeholders, table cells, and notes — one slide per paragraph block |
| DOC | Same shape as DOCX, parsed directly from the CFB piece table |
| XLS | Same shape as XLSX, parsed directly from BIFF8 records |
| PPT | Same shape as PPTX, parsed from the PowerPoint Document stream |
From bytes (no temp file)
Useful in serverless and streaming pipelines.
Python
import requests
from office_oxide import Document
data = requests.get("https://example.com/report.docx").content
with Document.from_bytes(data, "docx") as doc:
print(doc.plain_text())
JavaScript
import { Document } from 'office-oxide';
const res = await fetch('https://example.com/report.docx');
const data = new Uint8Array(await res.arrayBuffer());
using doc = Document.fromBytes(data, 'docx');
console.log(doc.plainText());
Rust
use std::io::Cursor;
use office_oxide::{Document, DocumentFormat};
let data = std::fs::read("report.docx")?;
let doc = Document::from_reader(Cursor::new(data), DocumentFormat::Docx)?;
let text = doc.plain_text();
Performance
| Format | Mean | p99 | Pass rate |
|---|---|---|---|
| DOCX (2,538 files) | 0.8ms | 3.9ms | 98.9% |
| XLSX (1,802 files) | 5.0ms | 40ms | 97.8% |
| PPTX (806 files) | 0.7ms | 3.9ms | 98.4% |
| DOC (246 files) | 0.3ms | 3.4ms | 94.7% |
| XLS (494 files) | 2.8ms | 75ms | 99.2% |
| PPT (176 files) | 0.7ms | 6.6ms | 100% |
See Performance for the full benchmark methodology.
See also
- Markdown extraction — same API, GitHub-flavored output
- HTML extraction — semantic HTML for previews and embeds
- Format-agnostic IR — structured JSON for pipelines and LLMs
- Tables — pulling structured rows out of XLSX, DOCX, PPTX