360+ languages · 318 with trained models
An open atlas of
language models
Pre-trained tokenizers, n-gram models, Markov chains, vocabularies, and embeddings for every Wikipedia language. Lightweight, research-ready, and built to run anywhere.
pip install wikilangs Why traditional models still matter
Large language models are powerful but expensive, opaque, and skewed toward well-resourced languages. Wikilangs takes a different path: lightweight models you can inspect, extend, and run on a laptop — for every language with a Wikipedia.
No GPU required
N-grams, Markov chains, and BPE tokenizers run instantly on CPU. Deploy anywhere — a server, a classroom, a Raspberry Pi.
Interpretable
Every prediction has a traceable path. Compression ratios, perplexity, entropy — metrics you can reason about and publish.
Extensible
Merge our vocabularies into LLMs to add language support. Use embeddings as features. Combine models for novel tasks.
Research-ready
Every language ships with ablation studies, visualizations, and a model card. Cite with a DOI. Reproduce with a date stamp.
Five model families, one API
Every language gets the same comprehensive treatment. Install once, switch languages with a single parameter.
BPE tokenizers in SentencePiece format. HuggingFace compatible.
from wikilangs import tokenizer
tok = tokenizer('latest', 'en', 32000)
tok.tokenize("Hello, world!") Language models for text scoring and next-token prediction.
from wikilangs import ngram
ng = ngram('latest', 'en', gram_size=3)
ng.score("Natural language processing") Text generation with configurable context windows.
from wikilangs import markov
mc = markov('latest', 'en', depth=3)
mc.generate(length=50) Word dictionaries with corpus statistics and prefix search.
from wikilangs import vocabulary
vocab = vocabulary('latest', 'en')
vocab.lookup("language") Position-aware word embeddings via BabelVec.
from wikilangs import embeddings
emb = embeddings('latest', 'en', dimension=64)
emb.embed_word("language") Extend large language models with new language support.
from wikilangs.llm import add_language_tokens
add_language_tokens(model, 'ary', 32000) Explore languages
A sample from across the atlas. Each language has its own model suite, evaluation report, and interactive playground.
Amharic
Basque
Russia Buriat
Gun
Latin
Kalaallisut
Komering
Southern Sotho
What you can build
Practical applications — from classroom demos to production systems.
Language Detection
Score text against multiple language models to identify the source language.
Autocomplete
Build text completion with prefix matching and n-gram predictions.
Text Similarity
Measure semantic similarity for search, deduplication, and FAQ matching.
Extend LLM Vocabulary
Add language support to models like LLaMA for low-resource languages.
Code-Switching Detection
Detect when text switches between languages — Spanglish, Hinglish, and more.
Anomaly Detection
Find gibberish, spam, and out-of-domain content using perplexity scoring.
Start exploring
Install the package and load your first model in under a minute.