Quick Start

Get up and running with Wikilangs in minutes.

Installation

Install Wikilangs from PyPI:

pip install wikilangs

For embedding support, also install BabelVec:

pip install babelvec

Basic Usage

Tokenization

Create a BPE tokenizer and tokenize text:

from wikilangs import tokenizer

# Create a tokenizer for English with 32k vocabulary
tok = tokenizer(date='latest', lang='en', vocab_size=32000)

# Tokenize text
text = "Natural language processing is fascinating."
tokens = tok.tokenize(text)
print(tokens)  # ['_Natural', '_language', '_processing', '_is', '_fasci', 'nating', '.']

# Encode to IDs
ids = tok.encode(text)
print(ids)  # [1234, 567, 890, 12, 345, 678, 9]

# Decode back to text
decoded = tok.decode(ids)
print(decoded)  # "Natural language processing is fascinating."

Text Scoring with N-grams

Score text probability using n-gram models:

from wikilangs import ngram

# Create a 3-gram model for English
ng = ngram(date='latest', lang='en', gram_size=3)

# Score text (returns log probability)
score = ng.score("This is a natural sentence.")
print(f"Log probability: {score}")

# Predict next tokens
predictions = ng.predict_next("The quick brown", top_k=5)
for token, prob in predictions:
    print(f"  {token}: {prob:.4f}")

Text Generation with Markov Chains

Generate text using Markov chain models:

from wikilangs import markov

# Create a Markov chain with context depth 3
mc = markov(date='latest', lang='en', depth=3)

# Generate random text
text = mc.generate(length=50)
print(text)

# Generate with a seed phrase
text = mc.generate(length=50, seed=["The", "history", "of"])
print(text)

Vocabulary Lookup

Look up word frequency and statistics:

from wikilangs import vocabulary

# Create a vocabulary instance
vocab = vocabulary(date='latest', lang='en')

# Look up a word
info = vocab.lookup("language")
print(info)
# {'token': 'language', 'frequency': 123456, 'idf_score': 5.67, 'rank': 234}

# Get word frequency
freq = vocab.get_frequency("language")
print(f"Frequency: {freq}")

# Find words with a prefix
words = vocab.get_words_with_prefix("lang", top_k=10)
print(words)

Word Embeddings

Get word and sentence embeddings:

from wikilangs import embeddings

# Create embeddings (requires babelvec)
emb = embeddings(date='latest', lang='en', dimension=64)

# Get word vector
word_vec = emb.embed_word("language")
print(word_vec.shape)  # (64,)

# Get sentence vector with position encoding
sent_vec = emb.embed_sentence(
    "Language is a beautiful thing.",
    method='rope'  # or 'average', 'decay', 'sinusoidal'
)
print(sent_vec.shape)  # (64,)

Discovering Languages

Find available languages and their metadata:

from wikilangs import languages, languages_with_metadata, get_language_info

# Get list of available language codes
langs = languages(date='latest')
print(f"Available: {len(langs)} languages")
print(langs[:10])  # ['aa', 'ab', 'ace', 'ady', 'af', ...]

# Get languages with ISO 639 metadata
lang_infos = languages_with_metadata(date='latest')
for info in lang_infos[:5]:
    print(f"{info.code}: {info.name}")

# Get info for a specific language
info = get_language_info('ary')
print(f"{info.name} ({info.code})")
print(f"  ISO 639-3: {info.alpha_3}")
print(f"  Type: {info.type}")

Model Variants

Vocabulary Sizes

Tokenizers come in 4 sizes:

  • 8k - Smallest, lowest compression
  • 16k - Balanced for general use
  • 32k - Recommended for production
  • 64k - Highest compression, largest model

N-gram Sizes

Choose context length based on your needs:

  • 2-gram - Fast, less context
  • 3-gram - Good balance (default)
  • 4-gram - More context, larger model
  • 5-gram - Most context, largest model

Word vs Subword

Both n-grams and Markov chains have word and subword variants:

# Word-level (default)
ng_word = ngram('latest', 'en', gram_size=3, variant='word')

# Subword-level (uses tokenizer vocabulary)
ng_subword = ngram('latest', 'en', gram_size=3, variant='subword')