360+ languages · 318 with trained models

An open atlas of
language models

Pre-trained tokenizers, n-gram models, Markov chains, vocabularies, and embeddings for every Wikipedia language. Lightweight, research-ready, and built to run anywhere.

pip install wikilangs
68,936,615 Total Words
4 Tokenizer Sizes
5 N-gram Depths
5 Markov Depths
3 Embedding Dims

Why traditional models still matter

Large language models are powerful but expensive, opaque, and skewed toward well-resourced languages. Wikilangs takes a different path: lightweight models you can inspect, extend, and run on a laptop — for every language with a Wikipedia.

No GPU required

N-grams, Markov chains, and BPE tokenizers run instantly on CPU. Deploy anywhere — a server, a classroom, a Raspberry Pi.

Interpretable

Every prediction has a traceable path. Compression ratios, perplexity, entropy — metrics you can reason about and publish.

Extensible

Merge our vocabularies into LLMs to add language support. Use embeddings as features. Combine models for novel tasks.

Research-ready

Every language ships with ablation studies, visualizations, and a model card. Cite with a DOI. Reproduce with a date stamp.

Five model families, one API

Every language gets the same comprehensive treatment. Install once, switch languages with a single parameter.

Tokenizers 8k · 16k · 32k · 64k

BPE tokenizers in SentencePiece format. HuggingFace compatible.

from wikilangs import tokenizer
tok = tokenizer('latest', 'en', 32000)
tok.tokenize("Hello, world!")
N-gram Models 2 · 3 · 4 · 5-gram

Language models for text scoring and next-token prediction.

from wikilangs import ngram
ng = ngram('latest', 'en', gram_size=3)
ng.score("Natural language processing")
Markov Chains depth 1 – 5

Text generation with configurable context windows.

from wikilangs import markov
mc = markov('latest', 'en', depth=3)
mc.generate(length=50)
Vocabularies frequency · IDF · rank

Word dictionaries with corpus statistics and prefix search.

from wikilangs import vocabulary
vocab = vocabulary('latest', 'en')
vocab.lookup("language")
Embeddings 32d · 64d · 128d

Position-aware word embeddings via BabelVec.

from wikilangs import embeddings
emb = embeddings('latest', 'en', dimension=64)
emb.embed_word("language")
LLM Integration vocab merging

Extend large language models with new language support.

from wikilangs.llm import add_language_tokens
add_language_tokens(model, 'ary', 32000)

Start exploring

Install the package and load your first model in under a minute.