Research & Methodology

Understanding the metrics, evaluation methods, and research behind Wikilangs.

Every language in Wikilangs includes a comprehensive research report with ablation studies, evaluation metrics, and visualizations. This page explains the methodology behind our evaluations and provides guidance for interpreting the metrics.

Metrics Glossary

Tokenizer Metrics

Compression Ratio

The ratio of characters to tokens (chars/token). Measures how efficiently the tokenizer represents text.

Intuition: Higher compression means fewer tokens needed to represent the same text, reducing sequence lengths for downstream models. A 3x compression means ~3 characters per token on average.

What to seek: Higher is generally better for efficiency, but extremely high compression may indicate overly aggressive merging that loses morphological information.

Average Token Length (Fertility)

Mean number of characters per token produced by the tokenizer.

Intuition: Longer tokens capture more context but may struggle with rare words; shorter tokens are more flexible but increase sequence length.

What to seek: Balance between 2-5 characters for most languages. Arabic and morphologically-rich languages may benefit from slightly longer tokens.

Unknown Token Rate (OOV Rate)

Percentage of tokens that map to the unknown/UNK token.

Intuition: Lower OOV means better vocabulary coverage. High OOV indicates the tokenizer encounters many unseen character sequences.

What to seek: Below 1% is excellent; below 5% is acceptable. BPE tokenizers typically achieve very low OOV due to subword fallback.

N-gram Model Metrics

Perplexity

Measures how "surprised" the model is by test data. Mathematically: 2^(cross-entropy). Lower values indicate better prediction.

Intuition: If perplexity is 100, the model is as uncertain as if choosing uniformly among 100 options at each step. A perplexity of 10 means effectively choosing among 10 equally likely options.

What to seek: Lower is better. Perplexity decreases with larger n-grams. Values vary widely by language and corpus size.

Entropy

Average information content (in bits) needed to encode the next token given the context. Related to perplexity: perplexity = 2^entropy.

Intuition: High entropy means high uncertainty/randomness; low entropy means predictable patterns. Natural language typically has entropy between 1-4 bits per character.

What to seek: Lower entropy indicates more predictable text patterns. Entropy should decrease as n-gram size increases.

Coverage (Top-K)

Percentage of corpus occurrences explained by the top K most frequent n-grams.

Intuition: High coverage with few patterns indicates repetitive/formulaic text; low coverage suggests diverse vocabulary usage.

What to seek: For language modeling, moderate coverage (40-60% with top-1000) is typical for natural text.

Markov Chain Metrics

Average Entropy

Mean entropy across all contexts, measuring average uncertainty in next-word prediction.

Intuition: Lower entropy means the model is more confident about what comes next. Context-1 has high entropy; Context-4 has low entropy.

What to seek: Decreasing entropy with larger context sizes. Very low entropy (<0.1) indicates highly deterministic transitions.

Branching Factor

Average number of unique next tokens observed for each context.

Intuition: High branching = many possible continuations (flexible but uncertain); low branching = few options (predictable but potentially repetitive).

What to seek: Branching factor should decrease with context size. Values near 1.0 indicate nearly deterministic chains.

Predictability

Derived metric: (1 - normalized_entropy) x 100%. Indicates how deterministic the model's predictions are.

Intuition: 100% means the next word is always certain; 0% means completely random.

What to seek: Higher predictability for text generation quality, but too high (>98%) may produce repetitive output.

Vocabulary & Zipf's Law Metrics

Zipf's Coefficient

The slope of the log-log plot of word frequency vs. rank. Zipf's law predicts this should be approximately -1.

Intuition: A coefficient near -1 indicates natural language patterns where a few words are very common and most words are rare.

What to seek: Values between -0.8 and -1.2 indicate healthy natural language distribution.

R² (Coefficient of Determination)

Measures how well the linear fit explains the frequency-rank relationship. Ranges from 0 to 1.

Intuition: R² near 1.0 means the data closely follows Zipf's law.

What to seek: R² > 0.95 is excellent; > 0.99 indicates near-perfect Zipf adherence.

Word Embedding Metrics

Isotropy

Measures how uniformly distributed vectors are in the embedding space. Computed as the ratio of minimum to maximum singular values.

Intuition: High isotropy (near 1.0) means vectors spread evenly in all directions; low isotropy means vectors cluster in certain directions.

What to seek: Higher isotropy generally indicates better-quality embeddings. Values > 0.1 are reasonable; > 0.3 is good.

Cosine Similarity

Measures angular similarity between vectors, ranging from -1 (opposite) to 1 (identical direction).

Intuition: Words with similar meanings should have high cosine similarity.

What to seek: Semantically related words should score > 0.5; synonyms often score > 0.7.

Evaluation Methodology

Data Source

All models are trained on wikipedia-monthly, a regularly updated dataset containing Wikipedia articles across 300+ languages. We use monthly snapshots to ensure reproducibility.

Train/Test Split

Each language's data is split into training and evaluation sets. Models are trained on the training set, and all metrics are computed on held-out test data.

Ablation Studies

We systematically vary model hyperparameters (vocabulary size, n-gram size, context depth, embedding dimension) to provide comprehensive comparisons and recommendations for each language.

Reproducibility

All training scripts, evaluation code, and raw results are available in our repositories. Each model card includes the exact date and configuration used for training.

Interpretation Guidelines

Compare within model families: Metrics are most meaningful when comparing models of the same type (e.g., 8k vs 64k tokenizer).
Consider trade-offs: Better performance on one metric often comes at the cost of another (e.g., compression vs. OOV rate).
Context matters: Optimal values depend on downstream tasks. Text generation may prioritize different metrics than classification.
Corpus influence: All metrics are influenced by corpus characteristics. Wikipedia text differs from social media or literature.
Language-specific patterns: Morphologically rich languages (like Arabic) may show different optimal ranges than analytic languages.