Skip to content

从 python-calamine(及 calamine)迁移

calamine 是口碑很好的 Rust XLSX/XLS 读取器,python-calamine 是它的 Python 绑定。二者都只聚焦于电子表格。

Office Oxide XLSX 读取比 python-calamine 快 2.8 倍(1,802 个文件,平均 5.0 ms 对 13.9 ms),通过率最高(97.8% 对 96.6%)。它还完整支持 DOCX、PPTX 和遗留 DOC/PPT — calamine 根本读不了这些格式。

何时迁移

下列任一条成立就切换:

  • 你还需要 .docx / .pptx / .doc / .ppt(calamine 只支持 XLSX/XLS)
  • 你想要更宽的功能面:Markdown / HTML 输出、结构化 IR、通过 EditableDocument 做模板化
  • 通过率比 calamine 在某些场景下的微弱性能优势更重要
  • 你在用 Python 绑定,希望减少跨 FFI 转换

继续用 calamine 的场景:

  • 你只读 .xlsx.xls
  • 你依赖 calamine 专有 API(Reader::with_header_rowworksheet_range_at 等)
  • 需要公式表达式(calamine 会暴露;Office Oxide 的 IR 不会)

安装(Python)

pip uninstall python-calamine
pip install office-oxide

安装(Rust)

# Cargo.toml
[dependencies]
# 替换:
#   calamine = "0.30"
office_oxide = "0.1.0"

对照速查 — Python

打开工作簿

python-calamine

from python_calamine import CalamineWorkbook

wb = CalamineWorkbook.from_path("budget.xlsx")

office_oxide

from office_oxide import Document

with Document.open("budget.xlsx") as doc:
    ...

遍历工作表

python-calamine

for name in wb.sheet_names:
    sheet = wb.get_sheet_by_name(name)
    for row in sheet.to_python():
        print(row)

office_oxide

with Document.open("budget.xlsx") as doc:
    ir = doc.to_ir()

for section in ir["sections"]:
    print(f"# {section.get('title')}")
    for el in section["elements"]:
        if el["kind"] == "Table":
            for row in el["rows"]:
                print(row)

把单个工作表读成行

python-calamine

sheet = wb.get_sheet_by_name("Q4")
rows = sheet.to_python()

office_oxide

with Document.open("budget.xlsx") as doc:
    table = next(
        el for section in doc.to_ir()["sections"]
        if section.get("title") == "Q4"
        for el in section["elements"] if el["kind"] == "Table"
    )
    rows = table["rows"]

更直接的写法:

with Document.open("budget.xlsx") as doc:
    sheet = doc.as_xlsx().sheet("Q4")
    rows = sheet.rows()    # list[list[str]]

工作表名

python-calamine

print(wb.sheet_names)

office_oxide

with Document.open("budget.xlsx") as doc:
    print([s.name() for s in doc.as_xlsx().sheets()])

对照速查 — Rust

打开并遍历

calamine

use calamine::{open_workbook, Xlsx, Reader};

let mut wb: Xlsx<_> = open_workbook("budget.xlsx")?;
for sheet_name in wb.sheet_names() {
    if let Ok(range) = wb.worksheet_range(&sheet_name) {
        for row in range.rows() {
            println!("{row:?}");
        }
    }
}

office_oxide

use office_oxide::Document;

let doc = Document::open("budget.xlsx")?;
if let Some(xlsx) = doc.as_xlsx() {
    for sheet in xlsx.sheets() {
        for cell in sheet.cells() {
            println!("{}: {:?}", cell.address(), cell.value());
        }
    }
}

与格式无关的 IR(calamine 无等价物)

let doc = Document::open("budget.xlsx")?;
let ir = doc.to_ir();
serde_json::to_writer(std::io::stdout(), &ir)?;

这和从 .docx.pptx 得到的形状一样 — 当下游消费者不需要关心源格式时非常有用。

XLSX 写入

calamine 是只读的。Office Oxide 通过 EditableDocument 写 XLSX 单元格:

from office_oxide import EditableDocument

with EditableDocument.open("budget.xlsx") as ed:
    ed.set_cell(0, "B5", 42_000)
    ed.save("budget.xlsx")

完整构建 XLSX,请下沉到 xlsx::create::XlsxBuilder,或用 umya-spreadsheet / rust_xlsxwriter

性能

XLSX 平均 p99 通过率
office_oxide 5.0 ms 40 ms 97.8%
python-calamine 13.9 ms 183 ms 96.6%
openpyxl 94.5 ms 698 ms 96.2%
XLS 平均 p99 通过率
office_oxide 2.8 ms 75 ms 99.2%
python-calamine 9.0 ms 96 ms 90.7%

差异

calamine 对每个单元格返回带类型的 Data 枚举(IntFloatStringBoolDateTimeEmptyError)。Office Oxide 的 IR 塌缩到字符串;要按类型访问单元格,使用格式专用访问器:

with Document.open("budget.xlsx") as doc:
    for sheet in doc.as_xlsx().sheets():
        for cell in sheet.cells():
            print(cell.value(), cell.value_type())   # value_type: "string" | "number" | "boolean" | "empty"

相关链接