Stamps Scholar Research

Building the first neural machine translation system for Adja, a language spoken by over a million people with zero digital infrastructure.

2024 — PresentFull Research Page

“A language spoken by over a million people shouldn't be invisible to every machine on earth.”

The Problem

Adja is a Gbe language spoken across southern Benin and Togo. Over a million people use it daily. Yet until this project, there was no dictionary, no translation tool, no dataset, and no NLP system for it. Not one.

This is not unusual. The vast majority of the world's 7,000+ languages have no digital presence. But Adja is my family's language, the one I grew up hearing but never learned, and the absence felt personal before it became a research question.

The Davis Peace Project was where this started: fieldwork in Benin creating the first French-Adja sentence corpus from scratch. This research is where it becomes a system.

The Insight

You cannot train a modern NMT system on 10,000 sentences. Large models need millions. But what if you do not start from zero?

Adja belongs to the Gbe language family, which includes Fon and Ewe. Fon has more digital resources and a closer linguistic relationship to Adja than, say, Swahili or Yoruba. The core insight of this work is that a model pretrained on Fon-French translation can transfer that knowledge to Adja-French, even with a small fine-tuning dataset.

This is not just a trick. It reflects how these languages actually relate to each other: shared grammar, overlapping vocabulary, similar tonal patterns. The model learns the “shape” of Gbe languages first, then specializes.

The Method

The pipeline has three stages:

Corpus creation. We built a 10,000+ sentence parallel corpus through community translation workshops in Benin. French sentences read aloud, translated by elders into Adja, transcribed phonetically, then digitized. Four people for one sentence.
Transfer learning. We pretrain a Transformer-based model on the larger Fon-French corpus from the FFR dataset, then fine-tune on our Adja-French data. The model inherits Gbe-family linguistic patterns before ever seeing Adja.
Few-shot adaptation. We apply few-shot techniques to maximize performance from limited data: curriculum learning, back-translation for data augmentation, and careful decontamination to ensure honest evaluation.

Results

10,000+

Parallel Sentences

French-Adja, the largest corpus for this language

BLEU Improvement

Over baseline with transfer learning approach

1M+

Speakers

With zero prior digital language technology

BLEU Score Progression

Baseline (no transfer)4

Fon-French pretrained9

+ Few-shot fine-tuning12

Illustrative of the trend. Final decontaminated BLEU scores will be reported in the paper.

A Framework, Not Just a Model

The point of this work is not one translation system for one language. It is a replicable framework that any language community can follow: build a small corpus through community workshops, identify a related higher-resource language for transfer learning, and fine-tune with few-shot techniques.

The Gbe language family alone has dozens of members. West Africa has hundreds of languages in similar positions: spoken by millions, invisible to machines. If this pipeline works for Adja, it can work for them too.

This is what I mean when I say I am building infrastructure, not just models. The goal is to lower the barrier so that the next researcher or community does not have to start from zero the way I did.

Team

Josue Godeme

Lead Researcher

Dartmouth College

Community Translators

Corpus Development

Mono & Couffo regions, Benin

What's Next

The paper, Adja-French Neural Machine Translation: A Few-Shot Transfer Learning Approach, is currently under review. I am also preparing a second paper evaluating LLM performance on low-resource West African languages more broadly.

On the product side, the long-term goal remains what it has always been: build the tools I could not find. A translation app. A dictionary. Resources that will help the next person who, like me, wants to learn their own language. The research makes the tools possible, and the tools make the research matter.

“The research makes the tools possible, and the tools make the research matter.”