Skip to content

Работа с байтами и стримами

Office Oxide принимает сырые байты как first-class-вход. Не надо писать во временный файл, чтобы потом парсить — удобно для serverless-обработчиков, multipart-загрузок, S3-объектов и database blob.

Из HTTP-ответа

Python

import requests
from office_oxide import Document

resp = requests.get("https://example.com/report.docx")
with Document.from_bytes(resp.content, "docx") as doc:
    print(doc.to_markdown())

JavaScript

import { Document } from 'office-oxide';

const res = await fetch('https://example.com/report.docx');
const data = new Uint8Array(await res.arrayBuffer());
using doc = Document.fromBytes(data, 'docx');
console.log(doc.toMarkdown());

Rust (reqwest)

use std::io::Cursor;
use office_oxide::{Document, DocumentFormat};

let bytes = reqwest::blocking::get(url)?.bytes()?;
let doc = Document::from_reader(Cursor::new(bytes.to_vec()), DocumentFormat::Docx)?;

Go

resp, _ := http.Get(url)
defer resp.Body.Close()
data, _ := io.ReadAll(resp.Body)
doc, _ := officeoxide.OpenFromBytes(data, "docx")

C#

using var http = new HttpClient();
byte[] data = await http.GetByteArrayAsync(url);
using var doc = Document.FromBytes(data, "docx");

Из S3

Python (boto3)

import boto3
from office_oxide import Document

s3 = boto3.client("s3")
obj = s3.get_object(Bucket="bucket", Key="reports/q4.xlsx")
data = obj["Body"].read()
with Document.from_bytes(data, "xlsx") as doc:
    print(doc.to_markdown())

Rust (aws-sdk-s3)

use aws_sdk_s3::Client;
use std::io::Cursor;
use office_oxide::{Document, DocumentFormat};

let client = Client::new(&aws_config::load_from_env().await);
let obj = client.get_object().bucket("bucket").key("reports/q4.xlsx").send().await?;
let bytes = obj.body.collect().await?.into_bytes();
let doc = Document::from_reader(Cursor::new(bytes.to_vec()), DocumentFormat::Xlsx)?;

Из multipart-загрузки (web framework)

Python (FastAPI)

from fastapi import FastAPI, UploadFile
from office_oxide import Document

app = FastAPI()

@app.post("/extract")
async def extract(file: UploadFile):
    data = await file.read()
    fmt = file.filename.rsplit(".", 1)[-1].lower()
    with Document.from_bytes(data, fmt) as doc:
        return {"markdown": doc.to_markdown()}

JavaScript (Hono / Express)

import { Hono } from 'hono';
import { Document } from 'office-oxide';

const app = new Hono();

app.post('/extract', async (c) => {
  const body = await c.req.parseBody();
  const file = body.file;        // File
  const data = new Uint8Array(await file.arrayBuffer());
  const fmt = file.name.split('.').pop().toLowerCase();
  using doc = Document.fromBytes(data, fmt);
  return c.json({ markdown: doc.toMarkdown() });
});

Из database BLOB

Python (SQLAlchemy)

from sqlalchemy import create_engine, text
from office_oxide import Document

engine = create_engine("postgresql://...")
with engine.begin() as conn:
    row = conn.execute(text("SELECT data, mime FROM uploads WHERE id = :id"),
                       {"id": 42}).one()
fmt = {"application/vnd.openxmlformats-officedocument.wordprocessingml.document": "docx",
       "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet":     "xlsx",
       "application/vnd.openxmlformats-officedocument.presentationml.presentation": "pptx"
       }[row.mime]
with Document.from_bytes(row.data, fmt) as doc:
    print(doc.plain_text())

Сохранение обратно в байты (для upload)

После редактирования можно записать результат в буфер и стримить клиенту напрямую.

Python

from office_oxide import EditableDocument
from io import BytesIO

with EditableDocument.open("template.docx") as ed:
    ed.replace_text("{{name}}", "Alice")
    bytes_out = ed.save_to_bytes()

# залить в S3, вернуть как HTTP-ответ и т. д.

JavaScript

using ed = EditableDocument.open('template.docx');
ed.replaceText('{{name}}', 'Alice');
const bytes = ed.saveToBytes();   // Uint8Array
return new Response(bytes, {
  headers: { 'Content-Type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document' },
});

Rust

use office_oxide::edit::EditableDocument;

let mut ed = EditableDocument::open("template.docx")?;
ed.replace_text("{{name}}", "Alice");

let mut buf = std::io::Cursor::new(Vec::new());
ed.write_to(&mut buf)?;
let bytes: Vec<u8> = buf.into_inner();

Выбор строки формата

from_bytes требует, чтобы вы сказали формат. Принимаемые строки строго:

"docx" | "xlsx" | "pptx" | "doc" | "xls" | "ppt"

Если источник неизвестен — сначала используйте detect_format:

import office_oxide
fmt = office_oxide.detect_format("payload.bin")  # → "docx" | None
if fmt:
    with Document.from_bytes(data, fmt) as doc:
        ...

Детектор читает magic bytes (ZIP для OOXML, CFB D0 CF 11 E0 для legacy) плюс быстрая инспекция списка частей, чтобы различать .docx / .xlsx / .pptx.

Смотрите также