About | Wikilangs

Mission

Wikilangs is an open-source initiative to democratize access to natural language processing models for every language represented on Wikipedia. We believe that language technology should be accessible to all communities, not just those with well-resourced languages.

By providing pre-trained tokenizers, language models, vocabularies, and embeddings for 358+ languages, we enable researchers, educators, and developers to work with languages that are often underserved by mainstream NLP tools.

Why Traditional Models?

While large language models have made impressive strides, they come with significant computational requirements and are predominantly trained on high-resource languages. Wikilangs takes a different approach:

Lightweight: Models run on any hardware, no GPU required
Interpretable: N-grams, Markov chains, and traditional embeddings are well-understood
Extensible: Use our models to extend LLMs with new language support
Research-ready: Every model includes comprehensive evaluation metrics

What We Provide

For each of the 358+ Wikipedia languages, we train and evaluate:

BPE Tokenizers in 4 vocabulary sizes (8k, 16k, 32k, 64k)
N-gram language models (2-5 gram, word and subword)
Markov chain text generators (depth 1-5)
Vocabulary dictionaries with frequency and IDF scores
Position-aware word embeddings via BabelVec

Each language repository includes a comprehensive research report with ablation studies, visualizations, and recommendations for production use.

Data Source

All models are trained on wikipedia-monthly, a regularly updated dataset of Wikipedia articles across all languages. This ensures our models stay current with the evolving content of Wikipedia.

Citation

If you use Wikilangs in your research, please cite:

@software{wikilangs2025,
  author       = {Kamali, Omar},
  title        = {WikiLangs: Pre-trained NLP Models for 340+ Languages},
  year         = {2025},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.18073152},
  url          = {https://doi.org/10.5281/zenodo.18073152}
}

View all citation formats (APA, MLA, Chicago)

Get in Touch

Have questions, suggestions, or want to collaborate?

Open an Issue Contact Omar

About Wikilangs

Mission

Why Traditional Models?

What We Provide

Data Source

Created By

Affiliation

Open Source

Related Projects

BabelVec

Wikisets

Wikipedia Monthly

Citation

Get in Touch

About Wikilangs

Mission

Why Traditional Models?

What We Provide

Data Source

Created By

Affiliation

Sponsored By

Open Source

Related Projects

BabelVec

Wikisets

Wikipedia Monthly

Citation

Get in Touch