Stamps Scholar Research
Building the first neural machine translation system for Adja, a language spoken by over a million people with zero digital infrastructure.
“A language spoken by over a million people shouldn't be invisible to every machine on earth.”
The Problem
Adja is a Gbe language spoken across southern Benin and Togo. Over a million people use it daily. Yet until this project, there was no dictionary, no translation tool, no dataset, and no NLP system for it. Not one.
This is not unusual. The vast majority of the world's 7,000+ languages have no digital presence. But Adja is my family's language, the one I grew up hearing but never learned, and the absence felt personal before it became a research question.
The Davis Peace Project was where this started: fieldwork in Benin creating the first French-Adja sentence corpus from scratch. This research is where it becomes a system.
The Insight
You cannot train a modern NMT system on 10,000 sentences. Large models need millions. But what if you do not start from zero?
Adja belongs to the Gbe language family, which includes Fon and Ewe. Fon has more digital resources and a closer linguistic relationship to Adja than, say, Swahili or Yoruba. The core insight of this work is that a model pretrained on Fon-French translation can transfer that knowledge to Adja-French, even with a small fine-tuning dataset.
This is not just a trick. It reflects how these languages actually relate to each other: shared grammar, overlapping vocabulary, similar tonal patterns. The model learns the “shape” of Gbe languages first, then specializes.
The Method
The pipeline has three stages:
- Corpus creation. We built a 10,000+ sentence parallel corpus through community translation workshops in Benin. French sentences read aloud, translated by elders into Adja, transcribed phonetically, then digitized. Four people for one sentence.
- Transfer learning. We pretrain a Transformer-based model on the larger Fon-French corpus from the FFR dataset, then fine-tune on our Adja-French data. The model inherits Gbe-family linguistic patterns before ever seeing Adja.
- Few-shot adaptation. We apply few-shot techniques to maximize performance from limited data: curriculum learning, back-translation for data augmentation, and careful decontamination to ensure honest evaluation.
Results
French-Adja, the largest corpus for this language
Over baseline with transfer learning approach
With zero prior digital language technology
BLEU Score Progression
Illustrative of the trend. Final decontaminated BLEU scores will be reported in the paper.
A Framework, Not Just a Model
The point of this work is not one translation system for one language. It is a replicable framework that any language community can follow: build a small corpus through community workshops, identify a related higher-resource language for transfer learning, and fine-tune with few-shot techniques.
The Gbe language family alone has dozens of members. West Africa has hundreds of languages in similar positions: spoken by millions, invisible to machines. If this pipeline works for Adja, it can work for them too.
This is what I mean when I say I am building infrastructure, not just models. The goal is to lower the barrier so that the next researcher or community does not have to start from zero the way I did.
Team
Josue Godeme
Lead Researcher
Dartmouth College
Community Translators
Corpus Development
Mono & Couffo regions, Benin
What's Next
The paper, Adja-French Neural Machine Translation: A Few-Shot Transfer Learning Approach, is currently under review. I am also preparing a second paper evaluating LLM performance on low-resource West African languages more broadly.
On the product side, the long-term goal remains what it has always been: build the tools I could not find. A translation app. A dictionary. Resources that will help the next person who, like me, wants to learn their own language. The research makes the tools possible, and the tools make the research matter.
“The research makes the tools possible, and the tools make the research matter.”