azb

South Azerbaijani تۆرکجه

ISO 639-3: azb I L

Open in Playground HuggingFace Wikipedia

0 Words in Vocabulary

Most Common Words

The 20 most frequently used words in South Azerbaijani Wikipedia.

Try in Playground

Explore South Azerbaijani interactively with our browser-based demos.

Generate Text Watch Markov chains weave words together Language DNA Explore character frequencies and word patterns Word Galaxy See how words connect in semantic space Zipf's Law Discover the power law of word frequencies Rare Words Explore the longest and rarest words All Demos Open the full playground for South Azerbaijani

Performance Dashboard

Key metrics for all model types at a glance.

Quick Start

Tokenizer from wikilangs import tokenizer

tok = tokenizer(date='latest', lang='azb', vocab_size=32000)
tokens = tok.tokenize("Your text here")
print(tokens) 

N-gram Model from wikilangs import ngram

ng = ngram(date='latest', lang='azb', gram_size=3)
score = ng.score("Your text here")
predictions = ng.predict_next("Start of", top_k=5) 

Markov Chain from wikilangs import markov

mc = markov(date='latest', lang='azb', depth=3)
text = mc.generate(length=50)
print(text) 

Vocabulary from wikilangs import vocabulary

vocab = vocabulary(date='latest', lang='azb')
info = vocab.lookup("word")
print(info)  # frequency, IDF, rank 

Embeddings from wikilangs import embeddings

emb = embeddings(date='latest', lang='azb', dimension=64)
vec = emb.embed_word("word")
sent_vec = emb.embed_sentence("A sentence", method='rope') 

Available Models

Model Type	Variants	Description
Tokenizers	8k, 16k, 32k, 64k	BPE tokenizers with different vocabulary sizes
N-gram (Word)	2, 3, 4, 5-gram	Word-level language models
N-gram (Subword)	2, 3, 4, 5-gram	Subword-level language models
Markov (Word)	Depth 1-5	Word-level text generation
Markov (Subword)	Depth 1-5	Subword-level text generation
Vocabulary	—	Word dictionary with frequency and IDF
Embeddings	32d, 64d, 128d	Position-aware word embeddings

Model Evaluation

Tokenizer Performance

Compression ratios and token statistics across vocabulary sizes. Higher compression means fewer tokens for the same text.

N-gram Model Evaluation

Perplexity and entropy metrics across n-gram sizes. Lower perplexity indicates better predictive performance.

Markov Chain Evaluation

Entropy and branching factor by context depth. Lower entropy means more predictable text generation.

Vocabulary Analysis

Word frequency distribution and Zipf's law analysis.

Embeddings Evaluation

Isotropy and vector space quality metrics. Higher isotropy indicates more uniformly distributed embeddings.

Key Metrics

Best Compression 4.15x Characters per token (higher is better)

Best Isotropy 0.8266 Embedding uniformity (higher is better)

Vocabulary Size 0 Unique words in training data

Full Research Report

Access the complete ablation study with all metrics, visualizations, and generated text samples on HuggingFace.

View Full Report on HuggingFace