Installation
Install Wikilangs from PyPI:
pip install wikilangs For embedding support, also install BabelVec:
pip install babelvec Basic Usage
Tokenization
Create a BPE tokenizer and tokenize text:
from wikilangs import tokenizer
# Create a tokenizer for English with 32k vocabulary
tok = tokenizer(date='latest', lang='en', vocab_size=32000)
# Tokenize text
text = "Natural language processing is fascinating."
tokens = tok.tokenize(text)
print(tokens) # ['_Natural', '_language', '_processing', '_is', '_fasci', 'nating', '.']
# Encode to IDs
ids = tok.encode(text)
print(ids) # [1234, 567, 890, 12, 345, 678, 9]
# Decode back to text
decoded = tok.decode(ids)
print(decoded) # "Natural language processing is fascinating." Text Scoring with N-grams
Score text probability using n-gram models:
from wikilangs import ngram
# Create a 3-gram model for English
ng = ngram(date='latest', lang='en', gram_size=3)
# Score text (returns log probability)
score = ng.score("This is a natural sentence.")
print(f"Log probability: {score}")
# Predict next tokens
predictions = ng.predict_next("The quick brown", top_k=5)
for token, prob in predictions:
print(f" {token}: {prob:.4f}") Text Generation with Markov Chains
Generate text using Markov chain models:
from wikilangs import markov
# Create a Markov chain with context depth 3
mc = markov(date='latest', lang='en', depth=3)
# Generate random text
text = mc.generate(length=50)
print(text)
# Generate with a seed phrase
text = mc.generate(length=50, seed=["The", "history", "of"])
print(text) Vocabulary Lookup
Look up word frequency and statistics:
from wikilangs import vocabulary
# Create a vocabulary instance
vocab = vocabulary(date='latest', lang='en')
# Look up a word
info = vocab.lookup("language")
print(info)
# {'token': 'language', 'frequency': 123456, 'idf_score': 5.67, 'rank': 234}
# Get word frequency
freq = vocab.get_frequency("language")
print(f"Frequency: {freq}")
# Find words with a prefix
words = vocab.get_words_with_prefix("lang", top_k=10)
print(words) Word Embeddings
Get word and sentence embeddings:
from wikilangs import embeddings
# Create embeddings (requires babelvec)
emb = embeddings(date='latest', lang='en', dimension=64)
# Get word vector
word_vec = emb.embed_word("language")
print(word_vec.shape) # (64,)
# Get sentence vector with position encoding
sent_vec = emb.embed_sentence(
"Language is a beautiful thing.",
method='rope' # or 'average', 'decay', 'sinusoidal'
)
print(sent_vec.shape) # (64,) Discovering Languages
Find available languages and their metadata:
from wikilangs import languages, languages_with_metadata, get_language_info
# Get list of available language codes
langs = languages(date='latest')
print(f"Available: {len(langs)} languages")
print(langs[:10]) # ['aa', 'ab', 'ace', 'ady', 'af', ...]
# Get languages with ISO 639 metadata
lang_infos = languages_with_metadata(date='latest')
for info in lang_infos[:5]:
print(f"{info.code}: {info.name}")
# Get info for a specific language
info = get_language_info('ary')
print(f"{info.name} ({info.code})")
print(f" ISO 639-3: {info.alpha_3}")
print(f" Type: {info.type}") Model Variants
Vocabulary Sizes
Tokenizers come in 4 sizes:
- 8k - Smallest, lowest compression
- 16k - Balanced for general use
- 32k - Recommended for production
- 64k - Highest compression, largest model
N-gram Sizes
Choose context length based on your needs:
- 2-gram - Fast, less context
- 3-gram - Good balance (default)
- 4-gram - More context, larger model
- 5-gram - Most context, largest model
Word vs Subword
Both n-grams and Markov chains have word and subword variants:
# Word-level (default)
ng_word = ngram('latest', 'en', gram_size=3, variant='word')
# Subword-level (uses tokenizer vocabulary)
ng_subword = ngram('latest', 'en', gram_size=3, variant='subword')