Research & Methodology
Wikilangs produces reproducible NLP models for 340+ languages from Wikipedia data. This page documents the full pipeline, evaluation metrics, and interpretation guidelines.
Pipeline
Each language passes through five stages โ from raw Wikipedia dump to published, documented model artifacts on HuggingFace.
Collection
Monthly Wikipedia snapshots via wikipedia-monthly across 340+ languages.
Processing
Markup removal, normalization, script & diacritic preservation. Features important to each language are retained.
Training
BPE tokenizers (8kโ64k), n-gram models (2โ5), Markov chains (ctx 1โ4), word embeddings (32โ128d).
Evaluation
Comprehensive metrics on held-out test data with ablation studies comparing all hyperparameter variants.
Publishing
Models, vocabularies, and evaluation reports published to HuggingFace with model cards.
Metrics Reference
Select a model family to see the metrics used for evaluation and how to interpret them.
Compression Ratio
- Definition
- The ratio of characters to tokens (chars/token). Measures how efficiently the tokenizer represents text.
- Intuition
- Higher compression means fewer tokens needed to represent the same text, reducing sequence lengths for downstream models. A 3ร compression means ~3 characters per token on average.
- What to seek
- Higher is generally better for efficiency, but extremely high compression may indicate overly aggressive merging that loses morphological information.
Average Token Length (Fertility)
- Definition
- Mean number of characters per token produced by the tokenizer.
- Intuition
- Longer tokens capture more context but may struggle with rare words; shorter tokens are more flexible but increase sequence length.
- What to seek
- Balance between 2โ5 characters for most languages. Arabic and morphologically-rich languages may benefit from slightly longer tokens.
Unknown Token Rate (OOV Rate)
- Definition
- Percentage of tokens that map to the unknown/UNK token.
- Intuition
- Lower OOV means better vocabulary coverage. High OOV indicates the tokenizer encounters many unseen character sequences.
- What to seek
- Below 1% is excellent; below 5% is acceptable. BPE tokenizers typically achieve very low OOV due to subword fallback.
Perplexity
- Definition
- Measures how "surprised" the model is by test data. Mathematically: 2(cross-entropy). Lower values indicate better prediction.
- Intuition
- If perplexity is 100, the model is as uncertain as if choosing uniformly among 100 options at each step. A perplexity of 10 means effectively choosing among 10 equally likely options.
- What to seek
- Lower is better. Perplexity decreases with larger n-grams. Values vary widely by language and corpus size.
Entropy
- Definition
- Average information content (in bits) needed to encode the next token given the context. Related to perplexity: perplexity = 2entropy.
- Intuition
- High entropy means high uncertainty/randomness; low entropy means predictable patterns. Natural language typically has entropy between 1โ4 bits per character.
- What to seek
- Lower entropy indicates more predictable text patterns. Entropy should decrease as n-gram size increases.
Coverage (Top-K)
- Definition
- Percentage of corpus occurrences explained by the top K most frequent n-grams.
- Intuition
- High coverage with few patterns indicates repetitive/formulaic text; low coverage suggests diverse vocabulary usage.
- What to seek
- For language modeling, moderate coverage (40โ60% with top-1000) is typical for natural text.
Average Entropy
- Definition
- Mean entropy across all contexts, measuring average uncertainty in next-word prediction.
- Intuition
- Lower entropy means the model is more confident about what comes next. Context-1 has high entropy; Context-4 has low entropy.
- What to seek
- Decreasing entropy with larger context sizes. Very low entropy (<0.1) indicates highly deterministic transitions.
Branching Factor
- Definition
- Average number of unique next tokens observed for each context.
- Intuition
- High branching = many possible continuations (flexible but uncertain); low branching = few options (predictable but potentially repetitive).
- What to seek
- Branching factor should decrease with context size. Values near 1.0 indicate nearly deterministic chains.
Predictability
- Definition
- Derived metric: (1 โ normalized_entropy) ร 100%. Indicates how deterministic the model's predictions are.
- Intuition
- 100% means the next word is always certain; 0% means completely random.
- What to seek
- Higher predictability for text generation quality, but too high (>98%) may produce repetitive output.
Zipf's Coefficient
- Definition
- The slope of the log-log plot of word frequency vs. rank. Zipf's law predicts this should be approximately โ1.
- Intuition
- A coefficient near โ1 indicates natural language patterns where a few words are very common and most words are rare.
- What to seek
- Values between โ0.8 and โ1.2 indicate healthy natural language distribution.
Rยฒ (Coefficient of Determination)
- Definition
- Measures how well the linear fit explains the frequency-rank relationship. Ranges from 0 to 1.
- Intuition
- Rยฒ near 1.0 means the data closely follows Zipf's law.
- What to seek
- Rยฒ > 0.95 is excellent; > 0.99 indicates near-perfect Zipf adherence.
Isotropy
- Definition
- Measures how uniformly distributed vectors are in the embedding space. Computed as the ratio of minimum to maximum singular values.
- Intuition
- High isotropy (near 1.0) means vectors spread evenly in all directions; low isotropy means vectors cluster in certain directions.
- What to seek
- Higher isotropy generally indicates better-quality embeddings. Values > 0.1 are reasonable; > 0.3 is good.
Cosine Similarity
- Definition
- Measures angular similarity between vectors, ranging from โ1 (opposite) to 1 (identical direction).
- Intuition
- Words with similar meanings should have high cosine similarity.
- What to seek
- Semantically related words should score > 0.5; synonyms often score > 0.7.
Interpretation Guidelines
Compare within model families
Metrics are most meaningful when comparing models of the same type โ e.g., an 8k vs. 64k tokenizer, or a 2-gram vs. 5-gram model.
Consider trade-offs
Better performance on one metric often comes at the cost of another. Higher compression may increase OOV rate; larger context reduces entropy but requires more data.
Context matters
Optimal values depend on downstream tasks. Text generation may prioritize different metrics than classification or search.
Corpus influence
All metrics are influenced by corpus characteristics. Wikipedia text differs from social media, literature, or conversational data.
Language-specific patterns
Morphologically rich languages (like Arabic or Turkish) may show different optimal ranges than analytic languages (like English or Chinese).