361+ languages · 322 with trained models

An open atlas of
language models

Pre-trained tokenizers, n-gram models, Markov chains, vocabularies, and embeddings for every Wikipedia language. Lightweight, research-ready, and built to run anywhere.

Get started Browse languages

pip install wikilangs

62,363,214 Total Words

4 Tokenizer Sizes

5 N-gram Depths

5 Markov Depths

3 Embedding Dims

Why traditional models still matter

Large language models are powerful but expensive, opaque, and skewed toward well-resourced languages. Wikilangs takes a different path: lightweight models you can inspect, extend, and run on a laptop — for every language with a Wikipedia.

No GPU required

N-grams, Markov chains, and BPE tokenizers run instantly on CPU. Deploy anywhere — a server, a classroom, a Raspberry Pi.

Interpretable

Every prediction has a traceable path. Compression ratios, perplexity, entropy — metrics you can reason about and publish.

Extensible

Merge our vocabularies into LLMs to add language support. Use embeddings as features. Combine models for novel tasks.

Research-ready

Every language ships with ablation studies, visualizations, and a model card. Cite with a DOI. Reproduce with a date stamp.

Five model families, one API

Every language gets the same comprehensive treatment. Install once, switch languages with a single parameter.

Tokenizers 8k · 16k · 32k · 64k

BPE tokenizers in SentencePiece format. HuggingFace compatible.

from wikilangs import tokenizer
tok = tokenizer('latest', 'en', 32000)
tok.tokenize("Hello, world!")

N-gram Models 2 · 3 · 4 · 5-gram

Language models for text scoring and next-token prediction.

from wikilangs import ngram
ng = ngram('latest', 'en', gram_size=3)
ng.score("Natural language processing")

Markov Chains depth 1 – 5

Text generation with configurable context windows.

from wikilangs import markov
mc = markov('latest', 'en', depth=3)
mc.generate(length=50)

Vocabularies frequency · IDF · rank

Word dictionaries with corpus statistics and prefix search.

from wikilangs import vocabulary
vocab = vocabulary('latest', 'en')
vocab.lookup("language")

Embeddings 32d · 64d · 128d

Position-aware word embeddings via BabelVec.

from wikilangs import embeddings
emb = embeddings('latest', 'en', dimension=64)
emb.embed_word("language")

LLM Integration vocab merging

Extend large language models with new language support.

from wikilangs.llm import add_language_tokens
add_language_tokens(model, 'ary', 32000)

Explore languages

A sample from across the atlas. Each language has its own model suite, evaluation report, and interactive playground.

All 361+ languages →

bo Models

What you can build

Practical applications — from classroom demos to production systems.

Start exploring

Install the package and load your first model in under a minute.

Quick start guide Read the docs

An open atlas of
language models

Why traditional models still matter

No GPU required

Interpretable

Extensible

Research-ready

Five model families, one API

Explore languages

Tibetan

Igbo

Saraiki

Bashkir

Slovenian

Swedish

Bishnupriya

Yiddish

What you can build

Language Detection

Autocomplete

Text Similarity

Extend LLM Vocabulary

Code-Switching Detection

Anomaly Detection

Start exploring

An open atlas oflanguage models

Why traditional models still matter

No GPU required

Interpretable

Extensible

Research-ready

Five model families, one API

Explore languages

Tibetan

Igbo

Saraiki

Bashkir

Slovenian

Swedish

Bishnupriya

Yiddish

What you can build

Language Detection

Autocomplete

Text Similarity

Extend LLM Vocabulary

Code-Switching Detection

Anomaly Detection

Start exploring

An open atlas of
language models