Quick Start

Get up and running with Wikilangs in minutes.

Installation

Install Wikilangs from PyPI:

pip install wikilangs

For embedding support, also install BabelVec:

pip install babelvec

Basic Usage

Tokenization

Create a BPE tokenizer and tokenize text:

from wikilangs import tokenizer

# Create a tokenizer for English with 32k vocabulary
tok = tokenizer(date='latest', lang='en', vocab_size=32000)

# Tokenize text
text = "Natural language processing is fascinating."
tokens = tok.tokenize(text)
print(tokens)  # ['_Natural', '_language', '_processing', '_is', '_fasci', 'nating', '.']

# Encode to IDs
ids = tok.encode(text)
print(ids)  # [1234, 567, 890, 12, 345, 678, 9]

# Decode back to text
decoded = tok.decode(ids)
print(decoded)  # "Natural language processing is fascinating."

Text Scoring with N-grams

Score text probability using n-gram models:

from wikilangs import ngram

# Create a 3-gram model for English
ng = ngram(date='latest', lang='en', gram_size=3)

# Score text (returns log probability)
score = ng.score("This is a natural sentence.")
print(f"Log probability: {score}")

# Predict next tokens
predictions = ng.predict_next("The quick brown", top_k=5)
for token, prob in predictions:
    print(f"  {token}: {prob:.4f}")

Text Generation with Markov Chains

Generate text using Markov chain models:

from wikilangs import markov

# Create a Markov chain with context depth 3
mc = markov(date='latest', lang='en', depth=3)

# Generate random text
text = mc.generate(length=50)
print(text)

# Generate with a seed phrase
text = mc.generate(length=50, seed=["The", "history", "of"])
print(text)

Vocabulary Lookup

Look up word frequency and statistics:

from wikilangs import vocabulary

# Create a vocabulary instance
vocab = vocabulary(date='latest', lang='en')

# Look up a word
info = vocab.lookup("language")
print(info)
# {'token': 'language', 'frequency': 123456, 'idf_score': 5.67, 'rank': 234}

# Get word frequency
freq = vocab.get_frequency("language")
print(f"Frequency: {freq}")

# Find words with a prefix
words = vocab.get_words_with_prefix("lang", top_k=10)
print(words)

Word Embeddings

Get word and sentence embeddings:

from wikilangs import embeddings

# Create embeddings (requires babelvec)
emb = embeddings(date='latest', lang='en', dimension=64)

# Get word vector
word_vec = emb.embed_word("language")
print(word_vec.shape)  # (64,)

# Get sentence vector with position encoding
sent_vec = emb.embed_sentence(
    "Language is a beautiful thing.",
    method='rope'  # or 'average', 'decay', 'sinusoidal'
)
print(sent_vec.shape)  # (64,)

Discovering Languages

Find available languages and their metadata:

from wikilangs import languages, languages_with_metadata, get_language_info

# Get list of available language codes
langs = languages(date='latest')
print(f"Available: {len(langs)} languages")
print(langs[:10])  # ['aa', 'ab', 'ace', 'ady', 'af', ...]

# Get languages with ISO 639 metadata
lang_infos = languages_with_metadata(date='latest')
for info in lang_infos[:5]:
    print(f"{info.code}: {info.name}")

# Get info for a specific language
info = get_language_info('ary')
print(f"{info.name} ({info.code})")
print(f"  ISO 639-3: {info.alpha_3}")
print(f"  Type: {info.type}")

Model Variants

Vocabulary Sizes

Tokenizers come in 4 sizes:

8k - Smallest, lowest compression
16k - Balanced for general use
32k - Recommended for production
64k - Highest compression, largest model

N-gram Sizes

Choose context length based on your needs:

2-gram - Fast, less context
3-gram - Good balance (default)
4-gram - More context, larger model
5-gram - Most context, largest model

Word vs Subword

Both n-grams and Markov chains have word and subword variants:

# Word-level (default)
ng_word = ngram('latest', 'en', gram_size=3, variant='word')

# Subword-level (uses tokenizer vocabulary)
ng_subword = ngram('latest', 'en', gram_size=3, variant='subword')

Next Steps

Browse Languages

Explore models for 340+ languages with full evaluation reports.

API Reference

Complete documentation for all modules and methods.

Metrics Guide

Understand perplexity, compression, isotropy, and more.

HuggingFace Models

Browse and download models directly from HuggingFace.