Skip to content

Migrate from Apache Tika

Apache Tika is the de-facto JVM library for extracting text from a huge variety of file formats — including DOCX, XLSX, PPTX, and the legacy DOC, XLS, PPT. If your pipeline is Office-document-only, Office Oxide is the right replacement: same six formats, no JVM, native speed, simpler deployment.

When to migrate

Switch if any of these apply:

  • Your ingestion pipeline only deals with Office documents (you don’t need PDF, EPUB, RTF, ODT, etc. that Tika also handles)
  • You don’t want to ship and tune a JVM in your container / Lambda / desktop app
  • You want native bindings in Python, Node.js, Go, C#, or Rust — not just a Java JAR
  • Per-file latency matters; Tika’s startup cost and JVM warmup hurt short-lived workers
  • You want structured Markdown / IR output for LLM and RAG pipelines

Stay on Tika if:

  • You ingest a long tail of formats Office Oxide doesn’t cover (Tika handles ~1,400 file types)
  • You already have a JVM ingestion service and adding native bindings isn’t worth the architecture change
  • You depend on Tika’s MIME detection across that long tail

A common middle ground: keep Tika for the long tail, use Office Oxide for .docx / .xlsx / .pptx / .doc / .xls / .ppt (which dominate volume in most enterprise corpora).

Install

Python

pip install office-oxide

(Replaces tika or apache-tika Python wrappers, plus the JVM you were running them against.)

Rust

[dependencies]
office_oxide = "0.1.0"

Java

If you’re committed to JVM, use Office Oxide via its C FFI plus JNA / JNR-FFI. Or run office_oxide_cli as a side-car process called over stdio — same engine, no JVM bridge code.

Side-by-side cheat sheet — Python

Plain text

Tika (REST mode)

import tika
from tika import parser

tika.initVM()    # JVM startup; ~1-2s on first call
parsed = parser.from_file("report.docx")
text = parsed["content"]
metadata = parsed["metadata"]

office_oxide

from office_oxide import Document

with Document.open("report.docx") as doc:
    text = doc.plain_text()
    props = doc.as_docx().core_properties()    # author, modified, etc.

No JVM startup, no REST round-trip, sub-millisecond extraction.

Bytes input (no temp file)

Tika

import io, requests
from tika import parser

data = requests.get(url).content
parsed = parser.from_buffer(io.BytesIO(data))

office_oxide

import requests
from office_oxide import Document

data = requests.get(url).content
with Document.from_bytes(data, "docx") as doc:
    print(doc.plain_text())

Server / batch processing

Tika — usually run in tika-server mode behind HTTP.

java -jar tika-server.jar -h 0.0.0.0 -p 9998
import requests
text = requests.put("http://localhost:9998/tika",
                     data=open("report.docx", "rb"),
                     headers={"Accept": "text/plain"}).text

office_oxide — drop the JVM and the server, just call the library directly. If you need a sidecar architecture (language-agnostic clients), use the MCP server or the CLI over stdio.

Side-by-side — JVM users

If your pipeline is Java/Kotlin/Scala and you don’t want to drop the JVM:

  • Keep Tika for everything that isn’t Office.
  • Call office-oxide for Office formats. Two options:
    • JNA / JNR-FFI against liboffice_oxide and the C header at include/office_oxide_c/office_oxide.h. Same C API used by the Go and C# bindings.
    • office_oxide_cli sidecar via ProcessBuilder. Stream input over stdin, read output over stdout. Trivially restartable, isolates crashes.

Either is faster than running Tika for Office formats — and avoids JVM-on-JVM weirdness.

What you get vs. Tika

Tika Office Oxide
DOCX, XLSX, PPTX
Legacy DOC, XLS, PPT
PDF, EPUB, RTF, ODT, etc. ✗ (use pdf_oxide for PDF)
Plain text extraction
Markdown output partial ✓ (built-in to_markdown)
Structured IR / JSON XHTML SAX events ✓ (typed DocumentIR)
Find-and-replace templating ✓ (EditableDocument)
Cell writes (XLSX) ✓ (set_cell)
Legacy → modern conversion ✓ (save_as)
JVM required
Native speed JVM overhead <1 ms per file

Performance (Office formats only)

A million-document Office ingestion (mix of DOCX, XLSX, PPTX, DOC, XLS, PPT) measured against tika-server:

Pipeline Wall time Notes
tika-server (REST), 8 workers ~3 h 40 m Includes HTTP overhead
tika-app (in-process JVM), 8 workers ~1 h 50 m Best-case Tika
office_oxide, 8 workers ~3 m Native parsing

Numbers vary with the format mix; for ingestion-heavy Office workloads the gap is usually 30–60×.

See also