SumTablets | Cole Simmons

Corpus size: 91,606 tablets / 6,970,407 glyphs
Source coverage: Readings to glyph 99.93%; glyph to Unicode 99.96%
Evaluation: Character-level chrF, averaged by tablet on held-out test data
Model delta: Weighted dictionary 61.22 → neural baseline 97.54 chrF

The dataset pairs source glyph sequences with transliterations for 91,606 tablets (about 6.97M glyphs), preserving structural context via special tokens for surfaces, line breaks, rulings, columns, blank space, and breakage.

On a stratified 90/5/5 split by historical period, a weighted dictionary sampler reaches 61.22 chrF, while an XLM-R-initialized encoder-decoder reaches 97.54 chrF. The objective is practical: speed philologist review, target uncertain readings, and make downstream restoration/translation pipelines tractable.

The bottleneck was not model choice first. It was a stable, machine-usable pairing of glyph strings with transliteration strings that preserves tablet structure and Assyriological conventions.

Input glyphs 𒀀𒂗𒆤 𒈦𒆳𒆳𒊏 𒀊𒁀𒀀𒀀𒌷𒉈𒆤 ...

Output transliteration {d}en-lil2 lugal kur-kur-ra ab-ba dingir-dingir-re2-ne-ke4 ...

Step 01 / Ingest + Normalize

Standardize transliterations from ePSD2/Oracc pipelines

Source transliterations are cleaned and canonicalized into a consistent format suitable for glyph-to-text learning, while retaining period and genre metadata for each tablet ID.

Step 02 / Reading → Glyph Mapping

Recover source glyphs from transliterated readings

A two-stage mapping converts transliterated readings to glyph names and then glyph names to Unicode cuneiform signs. Successful conversion retained 6,724,498 readings (99.93%) and 6,638,081 Unicode glyph mappings (99.96%).

Step 03 / Structural Fidelity

Encode tablet layout as aligned special tokens

Structure tokens are inserted in both glyph and transliteration streams so models do not lose tablet-level formatting that affects reading constraints.

[<SURFACE>] [ ] [...] [<RULING>] [<COLUMN>] [<BLANK_SPACE>]

Periods: 10 labeled periods for temporally aware evaluation and transfer studies.
Genres: 14 labeled genres exposing style/domain shift across administrative, literary, legal, and ritual texts.
Partitioning: 90/5/5 train/val/test, stratified by period to reduce period leakage and stabilize genre mix.
Lexical handling: Lexical texts excluded from validation/test and added to training only.

Two baselines establish the floor and ceiling: a weighted reading sampler and a multilingual transformer encoder-decoder with task-specific tokenization.

Baseline A / Non-neural

Weighted dictionary sampler

For each glyph, sample from known readings proportional to observed frequency. The weighted mean number of readings per glyph is 6.75.

This baseline establishes how far frequency-only disambiguation can go without context modeling.

[lookup sampling] [frequency prior] [no sequence context]

Baseline B / Neural

XLM-R initialized encoder-decoder

Encoder and decoder both initialize from a 279M-parameter XLM-R checkpoint, then are adapted for glyph-to-transliteration generation.

Custom SentencePiece vocabularies: 632 glyph tokens and 1024 transliteration tokens, each including 11 shared special tokens.

[XLM-R] [SentencePiece] [seq2seq] [beam search]

Stage 1: Encoder MLM pretraining: 50 epochs, seq len 64, LR 5e-5, batch 2048, mask prob 0.10, warmup 200.
Stage 2: Joint model with frozen encoder: LR 1e-4, 2 epochs, batch 128, warmup 100.
Stage 3: Joint model with unfrozen encoder: LR 5e-5, 4 epochs, batch 128, warmup 100.
Inference + Compute: Beam size 5, AdamW optimization, trained on single A100 SXM 80GB.

The neural baseline closes most of the transliteration gap, but performance remains uneven across historically and stylistically distinct genres.

Overall

Dictionary: 61.22
Neural: 97.54

Administrative

Dictionary: 63.15
Neural: 98.14

Royal inscription

Dictionary: 54.58
Neural: 95.15

Literary

Dictionary: 37.73
Neural: 90.67

Liturgy

Dictionary: 55.92
Neural: 77.68

Category	Dictionary chrF	Neural chrF
Overall	61.22	97.54
Administrative	63.15	98.14
Royal inscription	54.58	95.15
Literary	37.73	90.67
Liturgy	55.92	77.68

Interpretation

Genre imbalance still matters

Administrative texts dominate training distribution and also yield the highest scores. Lower-resource genres (especially liturgical/literary) remain a meaningful error source, suggesting gains from balanced curriculum sampling and uncertainty-aware decoding.

Operational Value

Useful now as expert-in-the-loop tooling

Even with residual errors, high chrF at corpus scale makes the model practical for draft transliteration generation and triaging legacy transliterations for specialist review rather than full manual first-pass reading.

SumTablets is infrastructure: it turns transliteration from a one-off manual act into a reproducible ML problem with public benchmarks and reusable assets.

Research trajectory

From transliteration to restoration and translation

A strong glyph-to-transliteration model is a prerequisite for damaged-tablet restoration, uncertainty estimation over readings, and low-resource Akkadian/Sumerian translation pipelines.

Scholarly unlock

Scale philological workflows without flattening nuance

By preserving layout and metadata, the dataset supports models that can respect period, genre, and document structure rather than collapsing tablets into context-free token streams.

My contribution

End-to-end technical ownership

Led data design, preprocessing, Unicode mapping, split strategy, tokenizer retraining, baseline implementation, training, evaluation, and release packaging across ACL paper, repository, and Hugging Face assets.

Every major component is public and reusable.