Migrate from Apache Tika
Apache Tika is the de-facto JVM library for extracting text from a huge variety of file formats — including DOCX, XLSX, PPTX, and the legacy DOC, XLS, PPT. If your pipeline is Office-document-only, Office Oxide is the right replacement: same six formats, no JVM, native speed, simpler deployment.
When to migrate
Switch if any of these apply:
- Your ingestion pipeline only deals with Office documents (you don’t need PDF, EPUB, RTF, ODT, etc. that Tika also handles)
- You don’t want to ship and tune a JVM in your container / Lambda / desktop app
- You want native bindings in Python, Node.js, Go, C#, or Rust — not just a Java JAR
- Per-file latency matters; Tika’s startup cost and JVM warmup hurt short-lived workers
- You want structured Markdown / IR output for LLM and RAG pipelines
Stay on Tika if:
- You ingest a long tail of formats Office Oxide doesn’t cover (Tika handles ~1,400 file types)
- You already have a JVM ingestion service and adding native bindings isn’t worth the architecture change
- You depend on Tika’s MIME detection across that long tail
A common middle ground: keep Tika for the long tail, use Office Oxide for .docx / .xlsx / .pptx / .doc / .xls / .ppt (which dominate volume in most enterprise corpora).
Install
Python
pip install office-oxide
(Replaces tika or apache-tika Python wrappers, plus the JVM you were running them against.)
Rust
[dependencies]
office_oxide = "0.1.0"
Java
If you’re committed to JVM, use Office Oxide via its C FFI plus JNA / JNR-FFI. Or run office_oxide_cli as a side-car process called over stdio — same engine, no JVM bridge code.
Side-by-side cheat sheet — Python
Plain text
Tika (REST mode)
import tika
from tika import parser
tika.initVM() # JVM startup; ~1-2s on first call
parsed = parser.from_file("report.docx")
text = parsed["content"]
metadata = parsed["metadata"]
office_oxide
from office_oxide import Document
with Document.open("report.docx") as doc:
text = doc.plain_text()
props = doc.as_docx().core_properties() # author, modified, etc.
No JVM startup, no REST round-trip, sub-millisecond extraction.
Bytes input (no temp file)
Tika
import io, requests
from tika import parser
data = requests.get(url).content
parsed = parser.from_buffer(io.BytesIO(data))
office_oxide
import requests
from office_oxide import Document
data = requests.get(url).content
with Document.from_bytes(data, "docx") as doc:
print(doc.plain_text())
Server / batch processing
Tika — usually run in tika-server mode behind HTTP.
java -jar tika-server.jar -h 0.0.0.0 -p 9998
import requests
text = requests.put("http://localhost:9998/tika",
data=open("report.docx", "rb"),
headers={"Accept": "text/plain"}).text
office_oxide — drop the JVM and the server, just call the library directly. If you need a sidecar architecture (language-agnostic clients), use the MCP server or the CLI over stdio.
Side-by-side — JVM users
If your pipeline is Java/Kotlin/Scala and you don’t want to drop the JVM:
- Keep Tika for everything that isn’t Office.
- Call
office-oxidefor Office formats. Two options:- JNA / JNR-FFI against
liboffice_oxideand the C header atinclude/office_oxide_c/office_oxide.h. Same C API used by the Go and C# bindings. office_oxide_clisidecar viaProcessBuilder. Stream input over stdin, read output over stdout. Trivially restartable, isolates crashes.
- JNA / JNR-FFI against
Either is faster than running Tika for Office formats — and avoids JVM-on-JVM weirdness.
What you get vs. Tika
| Tika | Office Oxide | |
|---|---|---|
| DOCX, XLSX, PPTX | ✓ | ✓ |
| Legacy DOC, XLS, PPT | ✓ | ✓ |
| PDF, EPUB, RTF, ODT, etc. | ✓ | ✗ (use pdf_oxide for PDF) |
| Plain text extraction | ✓ | ✓ |
| Markdown output | partial | ✓ (built-in to_markdown) |
| Structured IR / JSON | XHTML SAX events | ✓ (typed DocumentIR) |
| Find-and-replace templating | ✗ | ✓ (EditableDocument) |
| Cell writes (XLSX) | ✗ | ✓ (set_cell) |
| Legacy → modern conversion | ✗ | ✓ (save_as) |
| JVM required | ✓ | ✗ |
| Native speed | JVM overhead | <1 ms per file |
Performance (Office formats only)
A million-document Office ingestion (mix of DOCX, XLSX, PPTX, DOC, XLS, PPT) measured against tika-server:
| Pipeline | Wall time | Notes |
|---|---|---|
| tika-server (REST), 8 workers | ~3 h 40 m | Includes HTTP overhead |
| tika-app (in-process JVM), 8 workers | ~1 h 50 m | Best-case Tika |
| office_oxide, 8 workers | ~3 m | Native parsing |
Numbers vary with the format mix; for ingestion-heavy Office workloads the gap is usually 30–60×.
See also
- Performance benchmarks — full per-format numbers
- Office for RAG — Tika-replacement RAG patterns
- MCP server — sidecar for cross-language pipelines