Step 01 / Ingest + Normalize
SumTablets
A Transliteration Dataset of Sumerian Tablets
ACL 2024 · ML4AL Workshop
overview
SumTablets turns Sumerian transliteration into a supervised sequence-to-sequence task over Unicode glyphs, with open data and reproducible baselines.
The dataset pairs source glyph sequences with transliterations for 91,606 tablets (about 6.97M glyphs), preserving structural context via special tokens for surfaces, line breaks, rulings, columns, blank space, and breakage.
On a stratified 90/5/5 split by historical period, a weighted dictionary sampler reaches 61.22 chrF, while an XLM-R-initialized encoder-decoder reaches 97.54 chrF. The objective is practical: speed philologist review, target uncertain readings, and make downstream restoration/translation pipelines tractable.
data construction
The bottleneck was not model choice first. It was a stable, machine-usable pairing of glyph strings with transliteration strings that preserves tablet structure and Assyriological conventions.
Input glyphs 𒀀𒂗𒆤 𒈦𒆳𒆳𒊏 𒀊𒁀𒀀𒀀𒌷𒉈𒆤 ...
Output transliteration {d}en-lil2 lugal kur-kur-ra ab-ba dingir-dingir-re2-ne-ke4 ...
Step 02 / Reading → Glyph Mapping
Recover source glyphs from transliterated readings
Step 03 / Structural Fidelity
Encode tablet layout as aligned special tokens
modeling approach
Two baselines establish the floor and ceiling: a weighted reading sampler and a multilingual transformer encoder-decoder with task-specific tokenization.
Baseline A / Non-neural
Weighted dictionary sampler
For each glyph, sample from known readings proportional to observed frequency. The weighted mean number of readings per glyph is 6.75.
This baseline establishes how far frequency-only disambiguation can go without context modeling.
Baseline B / Neural
XLM-R initialized encoder-decoder
Encoder and decoder both initialize from a 279M-parameter XLM-R checkpoint, then are adapted for glyph-to-transliteration generation.
Custom SentencePiece vocabularies: 632 glyph tokens and 1024 transliteration tokens, each including 11 shared special tokens.
results
The neural baseline closes most of the transliteration gap, but performance remains uneven across historically and stylistically distinct genres.
| Category | Dictionary chrF | Neural chrF |
|---|---|---|
| Overall | 61.22 | 97.54 |
| Administrative | 63.15 | 98.14 |
| Royal inscription | 54.58 | 95.15 |
| Literary | 37.73 | 90.67 |
| Liturgy | 55.92 | 77.68 |
Interpretation
Genre imbalance still matters
Operational Value
Useful now as expert-in-the-loop tooling
why this matters
SumTablets is infrastructure: it turns transliteration from a one-off manual act into a reproducible ML problem with public benchmarks and reusable assets.
Research trajectory
From transliteration to restoration and translation
Scholarly unlock
Scale philological workflows without flattening nuance
My contribution
End-to-end technical ownership
artifacts
Every major component is public and reusable.