Research & Methodology

Wikilangs produces reproducible NLP models for 340+ languages from Wikipedia data. This page documents the full pipeline, evaluation metrics, and interpretation guidelines.

Pipeline

Each language passes through five stages โ€” from raw Wikipedia dump to published, documented model artifacts on HuggingFace.

Collection

Monthly Wikipedia snapshots via wikipedia-monthly across 340+ languages.

โ†’

Processing

Markup removal, normalization, script & diacritic preservation. Features important to each language are retained.

โ†’

Training

BPE tokenizers (8kโ€“64k), n-gram models (2โ€“5), Markov chains (ctx 1โ€“4), word embeddings (32โ€“128d).

โ†’

Evaluation

Comprehensive metrics on held-out test data with ablation studies comparing all hyperparameter variants.

โ†’

Publishing

Models, vocabularies, and evaluation reports published to HuggingFace with model cards.

Reproducibility Monthly snapshots, versioned datasets, public training scripts.
Train/Test Split All metrics computed on held-out data, never training data.
Ablation Studies Systematic hyperparameter variation per language per model family.
Open Access All code, models, and results freely available for research use.

Metrics Reference

Select a model family to see the metrics used for evaluation and how to interpret them.

Compression Ratio

Definition
The ratio of characters to tokens (chars/token). Measures how efficiently the tokenizer represents text.
Intuition
Higher compression means fewer tokens needed to represent the same text, reducing sequence lengths for downstream models. A 3ร— compression means ~3 characters per token on average.
What to seek
Higher is generally better for efficiency, but extremely high compression may indicate overly aggressive merging that loses morphological information.

Average Token Length (Fertility)

Definition
Mean number of characters per token produced by the tokenizer.
Intuition
Longer tokens capture more context but may struggle with rare words; shorter tokens are more flexible but increase sequence length.
What to seek
Balance between 2โ€“5 characters for most languages. Arabic and morphologically-rich languages may benefit from slightly longer tokens.

Unknown Token Rate (OOV Rate)

Definition
Percentage of tokens that map to the unknown/UNK token.
Intuition
Lower OOV means better vocabulary coverage. High OOV indicates the tokenizer encounters many unseen character sequences.
What to seek
Below 1% is excellent; below 5% is acceptable. BPE tokenizers typically achieve very low OOV due to subword fallback.

Perplexity

Definition
Measures how "surprised" the model is by test data. Mathematically: 2(cross-entropy). Lower values indicate better prediction.
Intuition
If perplexity is 100, the model is as uncertain as if choosing uniformly among 100 options at each step. A perplexity of 10 means effectively choosing among 10 equally likely options.
What to seek
Lower is better. Perplexity decreases with larger n-grams. Values vary widely by language and corpus size.

Entropy

Definition
Average information content (in bits) needed to encode the next token given the context. Related to perplexity: perplexity = 2entropy.
Intuition
High entropy means high uncertainty/randomness; low entropy means predictable patterns. Natural language typically has entropy between 1โ€“4 bits per character.
What to seek
Lower entropy indicates more predictable text patterns. Entropy should decrease as n-gram size increases.

Coverage (Top-K)

Definition
Percentage of corpus occurrences explained by the top K most frequent n-grams.
Intuition
High coverage with few patterns indicates repetitive/formulaic text; low coverage suggests diverse vocabulary usage.
What to seek
For language modeling, moderate coverage (40โ€“60% with top-1000) is typical for natural text.

Average Entropy

Definition
Mean entropy across all contexts, measuring average uncertainty in next-word prediction.
Intuition
Lower entropy means the model is more confident about what comes next. Context-1 has high entropy; Context-4 has low entropy.
What to seek
Decreasing entropy with larger context sizes. Very low entropy (<0.1) indicates highly deterministic transitions.

Branching Factor

Definition
Average number of unique next tokens observed for each context.
Intuition
High branching = many possible continuations (flexible but uncertain); low branching = few options (predictable but potentially repetitive).
What to seek
Branching factor should decrease with context size. Values near 1.0 indicate nearly deterministic chains.

Predictability

Definition
Derived metric: (1 โˆ’ normalized_entropy) ร— 100%. Indicates how deterministic the model's predictions are.
Intuition
100% means the next word is always certain; 0% means completely random.
What to seek
Higher predictability for text generation quality, but too high (>98%) may produce repetitive output.

Zipf's Coefficient

Definition
The slope of the log-log plot of word frequency vs. rank. Zipf's law predicts this should be approximately โˆ’1.
Intuition
A coefficient near โˆ’1 indicates natural language patterns where a few words are very common and most words are rare.
What to seek
Values between โˆ’0.8 and โˆ’1.2 indicate healthy natural language distribution.

Rยฒ (Coefficient of Determination)

Definition
Measures how well the linear fit explains the frequency-rank relationship. Ranges from 0 to 1.
Intuition
Rยฒ near 1.0 means the data closely follows Zipf's law.
What to seek
Rยฒ > 0.95 is excellent; > 0.99 indicates near-perfect Zipf adherence.

Isotropy

Definition
Measures how uniformly distributed vectors are in the embedding space. Computed as the ratio of minimum to maximum singular values.
Intuition
High isotropy (near 1.0) means vectors spread evenly in all directions; low isotropy means vectors cluster in certain directions.
What to seek
Higher isotropy generally indicates better-quality embeddings. Values > 0.1 are reasonable; > 0.3 is good.

Cosine Similarity

Definition
Measures angular similarity between vectors, ranging from โˆ’1 (opposite) to 1 (identical direction).
Intuition
Words with similar meanings should have high cosine similarity.
What to seek
Semantically related words should score > 0.5; synonyms often score > 0.7.

Interpretation Guidelines

Compare within model families

Metrics are most meaningful when comparing models of the same type โ€” e.g., an 8k vs. 64k tokenizer, or a 2-gram vs. 5-gram model.

Consider trade-offs

Better performance on one metric often comes at the cost of another. Higher compression may increase OOV rate; larger context reduces entropy but requires more data.

Context matters

Optimal values depend on downstream tasks. Text generation may prioritize different metrics than classification or search.

Corpus influence

All metrics are influenced by corpus characteristics. Wikipedia text differs from social media, literature, or conversational data.

Language-specific patterns

Morphologically rich languages (like Arabic or Turkish) may show different optimal ranges than analytic languages (like English or Chinese).