About Wikilangs

Making multilingual NLP accessible to everyone, one language at a time.

Mission

Wikilangs is an open-source initiative to democratize access to natural language processing models for every language represented on Wikipedia. We believe that language technology should be accessible to all communities, not just those with well-resourced languages.

By providing pre-trained tokenizers, language models, vocabularies, and embeddings for 358+ languages, we enable researchers, educators, and developers to work with languages that are often underserved by mainstream NLP tools.

Why Traditional Models?

While large language models have made impressive strides, they come with significant computational requirements and are predominantly trained on high-resource languages. Wikilangs takes a different approach:

  • Lightweight: Models run on any hardware, no GPU required
  • Interpretable: N-grams, Markov chains, and traditional embeddings are well-understood
  • Extensible: Use our models to extend LLMs with new language support
  • Research-ready: Every model includes comprehensive evaluation metrics

What We Provide

For each of the 358+ Wikipedia languages, we train and evaluate:

  • BPE Tokenizers in 4 vocabulary sizes (8k, 16k, 32k, 64k)
  • N-gram language models (2-5 gram, word and subword)
  • Markov chain text generators (depth 1-5)
  • Vocabulary dictionaries with frequency and IDF scores
  • Position-aware word embeddings via BabelVec

Each language repository includes a comprehensive research report with ablation studies, visualizations, and recommendations for production use.

Data Source

All models are trained on wikipedia-monthly, a regularly updated dataset of Wikipedia articles across all languages. This ensures our models stay current with the evolving content of Wikipedia.

Citation

If you use Wikilangs in your research, please cite:

@misc{wikilangs2025,
  title = {Wikilangs: Open NLP Models for Wikipedia Languages},
  author = {Kamali, Omar},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/wikilangs},
  institution = {Omneity Labs}
}

Get in Touch

Have questions, suggestions, or want to collaborate?