← Back to blog

Saving a Dying Dictionary: Archiving the SEAlang Javanese Language Resources

I wrote a structure-aware Python scraper that archives the surviving SEAlang resources into a local SQLite database.


Background

I have been studying Javanese for a while, not at an academic level, but seriously enough to work through Kakawin literature and to want proper tools for doing so. Javanology, the academic study of Javanese language, literature, and culture, is a field dominated by Dutch and Indonesian scholarship, where even the most authoritative references (Zoetmulder's Old Javanese-English Dictionary, Horne's A Javanese-English Dictionary) are written in English or Dutch rather than the language they document. The people whose heritage is being studied have largely been written about rather than written to.

For the past decade, one of my primary reference tools has been the SEAlang Library Javanese resources at sealang.net. Built with US Department of Education TICFIA grant funding, SEAlang provided something rare: a searchable, structured interface to multiple Javanese lexicographic and corpus resources in one place. It was not flashy. It ran on CGI Perl scripts that were old when I first used it. But it worked, and for a language as underserved by digital infrastructure as Javanese, "it works" is not a low bar.

Recently I noticed the SSL certificate had lapsed. The site had gotten slower. These are not good signs, and they are the kind of signs that tend to precede a resource quietly disappearing from the web.

Today I decided to find out how bad it was, and to do something about it before it was too late. This work connects directly to a larger project I have been building: aksara.ts, a TypeScript library for bidirectional transliteration between Latin-script Javanese and Aksara Jawa (Hanacaraka), paired with a neural word segmenter. The SEAlang resources are training data infrastructure for that pipeline.


What SEAlang Was

The SEAlang Library Javanese section contains several distinct resources:

SEAlang was funded by TICFIA, a US Department of Education program specifically targeting less commonly taught languages. According to the SEAlang project, the work was largely completed before the funding ended; the site has continued running since. It is running now. But running is not the same as maintained, and SSL certificate lapses are the kind of deferred maintenance that precede larger failures.


What I Found

I wrote a Python scraper to probe the live site and understand its structure before committing to any archival strategy. The picture is mixed.

Some endpoints are responding cleanly. Others are returning errors that suggest missing backend data files, though I am still working through whether those failures are permanent or intermittent server instability. I am not going to call anything definitively gone until I have run more complete scraping attempts and confirmed the pattern holds.

What I can say: the OJED is fully intact and queryable. 25,568 entries of Zoetmulder's Old Javanese-English Dictionary, still being served by a working Perl CGI script. I have also located a corpus search endpoint at a URL different from the one documented; it is serving KWIC (Key Word in Context) concordance results for the full Javanese Wikipedia corpus. Both of these are the priority targets for archival, and both are actively being scraped.

The scraping work is ongoing. I will have cleaner numbers on coverage once the runs complete.


What I Built

The Scraper

I wrote a structure-aware Python scraper that archives the surviving SEAlang resources into a local SQLite database. The scraper had to be built iteratively, the site's actual URL structure and data formats were only discoverable by probing the live server, since the SSL certificate lapse meant HTTPS was unavailable and the documentation was minimal.

Key design decisions:

OJED Results So Far

The OJED scraper queries the dictionary alphabetically, including Sanskrit-derived diacritic initials (ā, ī, ś, ṣ, ṭ, ḍ, ṅ etc.) that standard ASCII enumeration would miss. Preliminary results:

The ~880 missing entries are attributable to server instability during queries for specific diacritic initials; the server dropped connections on several requests. These should be recoverable in a future resume run.

Corpus Results So Far

The corpus required a different approach. Rather than a finite enumerable dataset like the dictionary, the corpus is only accessible via concordance search. You query a word and receive KWIC context windows showing that word in use. There is no bulk download endpoint.

I implemented a breadth-first search crawler that treats this constraint as a graph problem:

The crawl seeds from an initial word list, then expands autonomously; the corpus maps its own vocabulary. As of this writing, the crawl has discovered over 64,000 unique Javanese word forms and collected over 130,000 KWIC concordance rows, segmented Javanese sentence fragments with known word boundaries.

The crawl is still running.


How This Connects to aksara.ts

This archival work is not just preservation for its own sake. It feeds directly into aksara.ts, a TypeScript library I have been building for bidirectional transliteration between Latin-script Javanese and Aksara Jawa (Hanacaraka), paired with a neural word segmenter.

The core problem aksara.ts addresses: Aksara Jawa is essentially invisible to AI systems. Large language models were trained on internet-scale text; Aksara Jawa has almost none of it. When a model encounters ꦲꦤꦕꦫꦏ, it sees a sequence of rare Unicode codepoints with no semantic weight attached. Thousands of Javanese manuscripts exist only in physical form. Digitising them with OCR produces Unicode output that no downstream AI tool can use without a transliteration layer sitting in between.

aksara.ts is that layer. The intended pipeline:

manuscript image → OCR → Aksara Unicode → fromAksara() → Segmenter → readable Javanese → LLM

The segmenter, a BiLSTM model trained on character sequences, is the component that requires the most data. Its current training corpus is software UI translation strings from TranslateWiki via OPUS: clean, well-segmented modern Javanese, but formulaic and narrow in vocabulary. Words common in manuscript sources, lamun, yén, hutama, appear rarely or not at all.

The 130,000+ KWIC concordance rows from the SEAlang corpus crawl are the foundation for changing that. Segmented Javanese Wikipedia text represents a substantially broader register than software localisation strings, and the OJED entries, with their textual citations from classical kakawin sources, push further toward the literary vocabulary the segmenter needs to handle manuscript material reliably.


Why This Matters

Javanese is spoken by approximately 80 million people. It has one of the longest continuous written traditions of any language on earth, from the 9th century Canggal inscription through the great kakawin period of the 10th-14th centuries through to the present day. The Nagarakretagama, composed by Mpu Prapanca in 1365, is a work of literary ambition that rewards the same close reading applied to Dante or the Divine Comedy.

Digital infrastructure for Javanese NLP is correspondingly underdeveloped. The tools that exist for high-resource languages, robust corpora, segmentation models, named entity recognizers, morphological analyzers, are largely absent for Javanese. SEAlang was part of the thin layer of infrastructure that existed, and it is not being actively maintained.

The archive I am building preserves what survives and generates training data for downstream NLP applications, specifically the word segmentation model at the heart of the aksara.ts pipeline.


What Comes Next

This is part one of what will be an ongoing series. In subsequent posts I will cover more findings as I make discover them.

The scraper is open source. When the corpus crawl completes and the codebase is clean, I will publish it publicly on GitHub. The code contains no dictionary data, only the tooling to retrieve and archive it. Users who run it do so for personal study use only; the underlying dictionary content remains the intellectual property of its respective copyright holders.

If you are a researcher, linguist, or language enthusiast working with Javanese or other underserved Austronesian languages, I hope this work is useful to you. If you know of other Javanese digital resources that are at risk, please reach out.


The scraper will be published at sealang_jv when ready. aksara.ts is available now at aksara.ts.