360+ languages · 318 with trained models
An open atlas of
language models
Pre-trained tokenizers, n-gram models, Markov chains, vocabularies, and embeddings for every Wikipedia language. Lightweight, research-ready, and built to run anywhere.
pip install wikilangs Why traditional models still matter
Large language models are powerful but expensive, opaque, and skewed toward well-resourced languages. Wikilangs takes a different path: lightweight models you can inspect, extend, and run on a laptop — for every language with a Wikipedia.
No GPU required
N-grams, Markov chains, and BPE tokenizers run instantly on CPU. Deploy anywhere — a server, a classroom, a Raspberry Pi.
Interpretable
Every prediction has a traceable path. Compression ratios, perplexity, entropy — metrics you can reason about and publish.
Extensible
Merge our vocabularies into LLMs to add language support. Use embeddings as features. Combine models for novel tasks.
Research-ready
Every language ships with ablation studies, visualizations, and a model card. Cite with a DOI. Reproduce with a date stamp.
Five model families, one API
Every language gets the same comprehensive treatment. Install once, switch languages with a single parameter.
BPE tokenizers in SentencePiece format. HuggingFace compatible.
from wikilangs import tokenizer
tok = tokenizer('latest', 'en', 32000)
tok.tokenize("Hello, world!") Language models for text scoring and next-token prediction.
from wikilangs import ngram
ng = ngram('latest', 'en', gram_size=3)
ng.score("Natural language processing") Text generation with configurable context windows.
from wikilangs import markov
mc = markov('latest', 'en', depth=3)
mc.generate(length=50) Word dictionaries with corpus statistics and prefix search.
from wikilangs import vocabulary
vocab = vocabulary('latest', 'en')
vocab.lookup("language") Position-aware word embeddings via BabelVec.
from wikilangs import embeddings
emb = embeddings('latest', 'en', dimension=64)
emb.embed_word("language") Extend large language models with new language support.
from wikilangs.llm import add_language_tokens
add_language_tokens(model, 'ary', 32000) Explore languages
A sample from across the atlas. Each language has its own model suite, evaluation report, and interactive playground.
What you can build
Practical applications — from classroom demos to production systems.
Language Detection
Score text against multiple language models to identify the source language.
Autocomplete
Build text completion with prefix matching and n-gram predictions.
Text Similarity
Measure semantic similarity for search, deduplication, and FAQ matching.
Extend LLM Vocabulary
Add language support to models like LLaMA for low-resource languages.
Code-Switching Detection
Detect when text switches between languages — Spanglish, Hinglish, and more.
Anomaly Detection
Find gibberish, spam, and out-of-domain content using perplexity scoring.
Start exploring
Install the package and load your first model in under a minute.