Church Slavonic словѣньскъ / ⰔⰎⰑⰂⰡⰐⰠⰔⰍⰟ

ISO 639-1: cu ISO 639-3: chu I A

Open in Playground HuggingFace Wikipedia

0 Words in Vocabulary

Sample Text

Sample excerpts from Church Slavonic Wikipedia articles.

Хрїстъ Вседержитель Сінайскїѧ обители и восковая икона срѣ́дꙋ VI вѣка въ Кѡнстантїнополѣ сотворенная и въ 1962 гѡдѣ ѡбновлєннаѧ ѥсть · Сїѧ же и древнѣйшїй самозрачный образъ Їиса Хрїста ѥстъ ⁙

Ри́га и латвїискꙑ Rīga · стольнъ градъ Латвїѩ ѥстъ · ꙁьдана ѥстъ рѣцѣ Ꙁападьнѣ Дьвинѣ ⁙ Людии 709.145 обитаѥтъ ⁙ Основана 1201 лѣта ѥстъ нѣмьцкомь єпископомь Албєртомь · а помѣновєна ꙁапрьва лѣтописи 1198 лѣта ѥстъ ⁙ Градъ положєниѥмь самостоꙗтєл҄ьнѫ властьнѫ ѥдиницѫ сѧ одѣлꙗѥтъ

Browse Church Slavonic Wikipedia Try Text Generation

Most Common Words

The 20 most frequently used words in Church Slavonic Wikipedia.

Try in Playground

Explore Church Slavonic interactively with our browser-based demos.

Generate Text Watch Markov chains weave words together Language DNA Explore character frequencies and word patterns Word Galaxy See how words connect in semantic space Zipf's Law Discover the power law of word frequencies Rare Words Explore the longest and rarest words All Demos Open the full playground for Church Slavonic

Performance Dashboard

Key metrics for all model types at a glance.

Quick Start

Tokenizer from wikilangs import tokenizer

tok = tokenizer(date='latest', lang='cu', vocab_size=32000)
tokens = tok.tokenize("Your text here")
print(tokens) 

N-gram Model from wikilangs import ngram

ng = ngram(date='latest', lang='cu', gram_size=3)
score = ng.score("Your text here")
predictions = ng.predict_next("Start of", top_k=5) 

Markov Chain from wikilangs import markov

mc = markov(date='latest', lang='cu', depth=3)
text = mc.generate(length=50)
print(text) 

Vocabulary from wikilangs import vocabulary

vocab = vocabulary(date='latest', lang='cu')
info = vocab.lookup("word")
print(info)  # frequency, IDF, rank 

Embeddings from wikilangs import embeddings

emb = embeddings(date='latest', lang='cu', dimension=64)
vec = emb.embed_word("word")
sent_vec = emb.embed_sentence("A sentence", method='rope') 

Available Models

Model Type	Variants	Description
Tokenizers	8k, 16k, 32k, 64k	BPE tokenizers with different vocabulary sizes
N-gram (Word)	2, 3, 4, 5-gram	Word-level language models
N-gram (Subword)	2, 3, 4, 5-gram	Subword-level language models
Markov (Word)	Depth 1-5	Word-level text generation
Markov (Subword)	Depth 1-5	Subword-level text generation
Vocabulary	—	Word dictionary with frequency and IDF
Embeddings	32d, 64d, 128d	Position-aware word embeddings

Model Evaluation

Tokenizer Performance

Compression ratios and token statistics across vocabulary sizes. Higher compression means fewer tokens for the same text.

N-gram Model Evaluation

Perplexity and entropy metrics across n-gram sizes. Lower perplexity indicates better predictive performance.

Markov Chain Evaluation

Entropy and branching factor by context depth. Lower entropy means more predictable text generation.

Vocabulary Analysis

Word frequency distribution and Zipf's law analysis.

Embeddings Evaluation

Isotropy and vector space quality metrics. Higher isotropy indicates more uniformly distributed embeddings.

Key Metrics

Best Compression 4.94x Characters per token (higher is better)

Best Isotropy 0.2434 Embedding uniformity (higher is better)

Vocabulary Size 0 Unique words in training data

Full Research Report

Access the complete ablation study with all metrics, visualizations, and generated text samples on HuggingFace.

View Full Report on HuggingFace