Skip to content

Extract Text from Office Documents

Office Oxide gives you one entry point — extract_text() (or Document.open(...).plain_text()) — that works the same way across DOCX, XLSX, PPTX, DOC, XLS, and PPT. No format-specific code paths.

One-shot helper

The fastest path: a single function call that opens the file, runs the format-appropriate extractor, and returns a string.

Rust

use office_oxide::extract_text;

let text = extract_text("report.docx")?;
println!("{text}");

Python

import office_oxide

text = office_oxide.extract_text("report.docx")
print(text)

JavaScript

import { extractText } from 'office-oxide';

console.log(extractText('report.docx'));

Go

import officeoxide "github.com/yfedoseev/office_oxide/go"

text, err := officeoxide.ExtractText("report.docx")

C#

using OfficeOxide;

string text = OfficeOxide.ExtractText("report.docx");

Reusable handle

If you need text plus other outputs (Markdown, HTML, IR), open the document once and reuse the handle:

Rust

use office_oxide::Document;

let doc = Document::open("report.docx")?;
let text = doc.plain_text();
let md   = doc.to_markdown();

Python

from office_oxide import Document

with Document.open("report.docx") as doc:
    text = doc.plain_text()
    md   = doc.to_markdown()

JavaScript

import { Document } from 'office-oxide';

using doc = Document.open('report.docx');
const text = doc.plainText();
const md   = doc.toMarkdown();

What you get per format

Format Output
DOCX Body text in document order, plus headers and footers; soft hyphens stripped
XLSX Cell values across every sheet, tab-separated within a row, blank line between sheets
PPTX Slide title, body placeholders, table cells, and notes — one slide per paragraph block
DOC Same shape as DOCX, parsed directly from the CFB piece table
XLS Same shape as XLSX, parsed directly from BIFF8 records
PPT Same shape as PPTX, parsed from the PowerPoint Document stream

From bytes (no temp file)

Useful in serverless and streaming pipelines.

Python

import requests
from office_oxide import Document

data = requests.get("https://example.com/report.docx").content
with Document.from_bytes(data, "docx") as doc:
    print(doc.plain_text())

JavaScript

import { Document } from 'office-oxide';

const res = await fetch('https://example.com/report.docx');
const data = new Uint8Array(await res.arrayBuffer());
using doc = Document.fromBytes(data, 'docx');
console.log(doc.plainText());

Rust

use std::io::Cursor;
use office_oxide::{Document, DocumentFormat};

let data = std::fs::read("report.docx")?;
let doc = Document::from_reader(Cursor::new(data), DocumentFormat::Docx)?;
let text = doc.plain_text();

Performance

Format Mean p99 Pass rate
DOCX (2,538 files) 0.8ms 3.9ms 98.9%
XLSX (1,802 files) 5.0ms 40ms 97.8%
PPTX (806 files) 0.7ms 3.9ms 98.4%
DOC (246 files) 0.3ms 3.4ms 94.7%
XLS (494 files) 2.8ms 75ms 99.2%
PPT (176 files) 0.7ms 6.6ms 100%

See Performance for the full benchmark methodology.

See also