About Wikilangs
Making multilingual NLP accessible to everyone, one language at a time.
Mission
Wikilangs is an open-source initiative to democratize access to natural language processing models for every language represented on Wikipedia. We believe that language technology should be accessible to all communities, not just those with well-resourced languages.
By providing pre-trained tokenizers, language models, vocabularies, and embeddings for 358+ languages, we enable researchers, educators, and developers to work with languages that are often underserved by mainstream NLP tools.
Why Traditional Models?
While large language models have made impressive strides, they come with significant computational requirements and are predominantly trained on high-resource languages. Wikilangs takes a different approach:
- Lightweight: Models run on any hardware, no GPU required
- Interpretable: N-grams, Markov chains, and traditional embeddings are well-understood
- Extensible: Use our models to extend LLMs with new language support
- Research-ready: Every model includes comprehensive evaluation metrics
What We Provide
For each of the 358+ Wikipedia languages, we train and evaluate:
- BPE Tokenizers in 4 vocabulary sizes (8k, 16k, 32k, 64k)
- N-gram language models (2-5 gram, word and subword)
- Markov chain text generators (depth 1-5)
- Vocabulary dictionaries with frequency and IDF scores
- Position-aware word embeddings via BabelVec
Each language repository includes a comprehensive research report with ablation studies, visualizations, and recommendations for production use.
Data Source
All models are trained on wikipedia-monthly, a regularly updated dataset of Wikipedia articles across all languages. This ensures our models stay current with the evolving content of Wikipedia.
Related Projects
BabelVec
Position-aware word-based sentence embeddings. Built for CPU training and inference, and small training corpora.
Wikisets
Flexible Wikipedia dataset builder to compose your own combination of languages, dates and sizes.
Wikipedia Monthly
Monthly snapshots of Wikipedia articles across 300+ languages on HuggingFace.
Citation
If you use Wikilangs in your research, please cite:
@misc{wikilangs2025,
title = {Wikilangs: Open NLP Models for Wikipedia Languages},
author = {Kamali, Omar},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/wikilangs},
institution = {Omneity Labs}
}