Step 01 / Ingest + Normalize
SumTablets
SumTablets turns Sumerian transliteration into a supervised sequence-to-sequence task over Unicode glyphs, with open data and reproducible baselines.
acl 2024 · ml4al workshop
- Corpus size
- 91,606 tablets / 6,970,407 glyphs
- Source coverage
- Readings to glyph 99.93%; glyph to Unicode 99.96%
- Evaluation
- Character-level chrF, averaged by tablet on held-out test data
- Model delta
- Weighted dictionary 61.22 → neural baseline 97.54 chrF
The dataset pairs source glyph sequences with transliterations for 91,606 tablets (about 6.97M glyphs), preserving structural context via special tokens for surfaces, line breaks, rulings, columns, blank space, and breakage.
On a stratified 90/5/5 split by historical period, a weighted dictionary sampler reaches 61.22 chrF, while an XLM-R-initialized encoder-decoder reaches 97.54 chrF. The objective is practical: speed philologist review, target uncertain readings, and make downstream restoration/translation pipelines tractable.
The bottleneck was not model choice first. It was a stable, machine-usable pairing of glyph strings with transliteration strings that preserves tablet structure and Assyriological conventions.
Input glyphs 𒀀𒂗𒆤 𒈦𒆳𒆳𒊏 𒀊𒁀𒀀𒀀𒌷𒉈𒆤 ...
Output transliteration {d}en-lil2 lugal kur-kur-ra ab-ba dingir-dingir-re2-ne-ke4 ...
Step 02 / Reading → Glyph Mapping
Recover source glyphs from transliterated readings
Step 03 / Structural Fidelity
Encode tablet layout as aligned special tokens
[<SURFACE>] [ ] [...] [<RULING>] [<COLUMN>] [<BLANK_SPACE>]
- Periods
- 10 labeled periods for temporally aware evaluation and transfer studies.
- Genres
- 14 labeled genres exposing style/domain shift across administrative, literary, legal, and ritual texts.
- Partitioning
- 90/5/5 train/val/test, stratified by period to reduce period leakage and stabilize genre mix.
- Lexical handling
- Lexical texts excluded from validation/test and added to training only.
Two baselines establish the floor and ceiling: a weighted reading sampler and a multilingual transformer encoder-decoder with task-specific tokenization.
Baseline A / Non-neural
Weighted dictionary sampler
For each glyph, sample from known readings proportional to observed frequency. The weighted mean number of readings per glyph is 6.75.
This baseline establishes how far frequency-only disambiguation can go without context modeling.
[lookup sampling] [frequency prior] [no sequence context]
Baseline B / Neural
XLM-R initialized encoder-decoder
Encoder and decoder both initialize from a 279M-parameter XLM-R checkpoint, then are adapted for glyph-to-transliteration generation.
Custom SentencePiece vocabularies: 632 glyph tokens and 1024 transliteration tokens, each including 11 shared special tokens.
[XLM-R] [SentencePiece] [seq2seq] [beam search]
- Stage 1
- Encoder MLM pretraining: 50 epochs, seq len 64, LR 5e-5, batch 2048, mask prob 0.10, warmup 200.
- Stage 2
- Joint model with frozen encoder: LR 1e-4, 2 epochs, batch 128, warmup 100.
- Stage 3
- Joint model with unfrozen encoder: LR 5e-5, 4 epochs, batch 128, warmup 100.
- Inference + Compute
- Beam size 5, AdamW optimization, trained on single A100 SXM 80GB.
The neural baseline closes most of the transliteration gap, but performance remains uneven across historically and stylistically distinct genres.
Overall
- Dictionary
- 61.22
- Neural
- 97.54
Administrative
- Dictionary
- 63.15
- Neural
- 98.14
Royal inscription
- Dictionary
- 54.58
- Neural
- 95.15
Literary
- Dictionary
- 37.73
- Neural
- 90.67
Liturgy
- Dictionary
- 55.92
- Neural
- 77.68
Interpretation
Genre imbalance still matters
Operational Value
Useful now as expert-in-the-loop tooling
SumTablets is infrastructure: it turns transliteration from a one-off manual act into a reproducible ML problem with public benchmarks and reusable assets.
Research trajectory
From transliteration to restoration and translation
Scholarly unlock
Scale philological workflows without flattening nuance
My contribution
End-to-end technical ownership
Every major component is public and reusable.