omarkamali commited on 3 days ago

Commit

a6db3d4

verified ·

1 Parent(s): f777518

Upload all models and assets for ch (20251001)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +278 -121
models/embeddings/monolingual/ch_128d.bin +2 -2
models/embeddings/monolingual/ch_128d_metadata.json +5 -3
models/embeddings/monolingual/ch_32d.bin +2 -2
models/embeddings/monolingual/ch_32d_metadata.json +5 -3
models/embeddings/monolingual/ch_64d.bin +2 -2
models/embeddings/monolingual/ch_64d_metadata.json +5 -3
models/subword_markov/ch_markov_ctx1_subword.parquet +2 -2
models/subword_markov/ch_markov_ctx1_subword_metadata.json +2 -2
models/subword_markov/ch_markov_ctx2_subword.parquet +2 -2
models/subword_markov/ch_markov_ctx2_subword_metadata.json +2 -2
models/subword_markov/ch_markov_ctx3_subword.parquet +2 -2
models/subword_markov/ch_markov_ctx3_subword_metadata.json +2 -2
models/subword_markov/ch_markov_ctx4_subword.parquet +2 -2
models/subword_markov/ch_markov_ctx4_subword_metadata.json +2 -2
models/subword_ngram/ch_2gram_subword.parquet +2 -2
models/subword_ngram/ch_2gram_subword_metadata.json +2 -2
models/subword_ngram/ch_3gram_subword.parquet +2 -2
models/subword_ngram/ch_3gram_subword_metadata.json +2 -2
models/subword_ngram/ch_4gram_subword.parquet +2 -2
models/subword_ngram/ch_4gram_subword_metadata.json +2 -2
models/tokenizer/ch_tokenizer_16k.model +2 -2
models/tokenizer/ch_tokenizer_16k.vocab +0 -0
models/tokenizer/ch_tokenizer_8k.model +2 -2
models/tokenizer/ch_tokenizer_8k.vocab +0 -0
models/vocabulary/ch_vocabulary.parquet +2 -2
models/vocabulary/ch_vocabulary_metadata.json +9 -8
models/word_markov/ch_markov_ctx1_word.parquet +2 -2
models/word_markov/ch_markov_ctx1_word_metadata.json +2 -2
models/word_markov/ch_markov_ctx2_word.parquet +2 -2
models/word_markov/ch_markov_ctx2_word_metadata.json +2 -2
models/word_markov/ch_markov_ctx3_word.parquet +2 -2
models/word_markov/ch_markov_ctx3_word_metadata.json +2 -2
models/word_markov/ch_markov_ctx4_word.parquet +2 -2
models/word_markov/ch_markov_ctx4_word_metadata.json +2 -2
models/word_ngram/ch_2gram_word.parquet +2 -2
models/word_ngram/ch_2gram_word_metadata.json +2 -2
models/word_ngram/ch_3gram_word.parquet +2 -2
models/word_ngram/ch_3gram_word_metadata.json +2 -2
models/word_ngram/ch_4gram_word.parquet +2 -2
models/word_ngram/ch_4gram_word_metadata.json +2 -2
visualizations/embedding_isotropy.png +0 -0
visualizations/embedding_norms.png +0 -0
visualizations/embedding_similarity.png +2 -2
visualizations/markov_branching.png +0 -0
visualizations/markov_contexts.png +0 -0
visualizations/markov_entropy.png +0 -0
visualizations/model_sizes.png +0 -0
visualizations/ngram_coverage.png +0 -0
visualizations/ngram_entropy.png +0 -0

README.md CHANGED Viewed

@@ -23,14 +23,14 @@ dataset_info:
 metrics:
   - name: best_compression_ratio
     type: compression
-    value: 4.113
   - name: best_isotropy
     type: isotropy
-    value: 0.0314
   - name: vocabulary_size
     type: vocab
-    value: 2085
-generated: 2025-12-28
 ---
 # CH - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
-- N-gram models (2, 3, 4-gram)
-- Markov chains (context of 1, 2, 3 and 4)
 - Subword N-gram and Markov chains
-- Embeddings in various sizes and dimensions
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
-- [6. Summary & Recommendations](#6-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
@@ -68,44 +70,49 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
-| **8k** | 3.833x | 3.79 | 0.1107% | 46,087 |
-| **16k** | 4.113x 🏆 | 4.07 | 0.1187% | 42,948 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
-**Sample 1:** `+ 125px 125px 300px
-Bulgaria, capitat Sofia.`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁+ ▁ 1 2 5 px ▁ 1 2 5 ... (+12 more)` | 22 |
-| 16k | `▁+ ▁ 1 2 5 px ▁ 1 2 5 ... (+11 more)` | 21 |
-**Sample 2:** `Giacomo Puccini (1858, Lucca, Italia - 1924, Bruselas, Belgica), un danderu åtmo...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁gi a com o ▁pu ccini ▁( 1 8 5 ... (+29 more)` | 39 |
-| 16k | `▁giacomo ▁puccini ▁( 1 8 5 8 , ▁lucca , ... (+25 more)` | 35 |
-**Sample 3:** `I Sisteman Laibirihan Pupbleko Guåhan este na ufsiåt na sistema para i islan Guå...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁i ▁sisteman ▁lai birihan ▁pupble ko ▁guåhan ▁este ▁na ▁ufsiåt ... (+7 more)` | 17 |
-| 16k | `▁i ▁sisteman ▁laibirihan ▁pupbleko ▁guåhan ▁este ▁na ▁ufsiåt ▁na ▁sistema ... (+5 more)` | 15 |
 ### Key Findings
-- **Best Compression:** 16k achieves 4.113x compression
-- **Lowest UNK Rate:** 8k with 0.1107% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
@@ -114,57 +121,89 @@ Bulgaria, capitat Sofia.`
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
-| N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
-|--------|------------|---------|----------------|------------------|-------------------|
-| **2-gram** | 304 🏆 | 8.25 | 836 | 61.4% | 100.0% |
-| **2-gram** | 262 🏆 | 8.03 | 1,028 | 68.2% | 99.9% |
-| **3-gram** | 367 | 8.52 | 1,276 | 58.0% | 93.7% |
-| **3-gram** | 1,384 | 10.43 | 5,255 | 35.7% | 78.3% |
-| **4-gram** | 472 | 8.88 | 1,937 | 54.6% | 83.8% |
-| **4-gram** | 3,626 | 11.82 | 13,909 | 27.2% | 58.1% |
 ### Top 5 N-grams by Size
-**2-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `katigoria :` | 774 |
-| 2 | `estados unidos` | 443 |
-| 3 | `. katigoria` | 411 |
-| 4 | `i sengsong` | 364 |
-| 5 | `. guåha` | 328 |
-**3-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `. katigoria :` | 411 |
-| 2 | `nu i senso` | 308 |
 | 3 | `na populasion i` | 304 |
-| 4 | `na tataogues na` | 304 |
-| 5 | `tataogues na populasion` | 304 |
-**4-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
 | 1 | `na tataogues na populasion` | 304 |
 | 2 | `tataogues na populasion i` | 303 |
-| 3 | `i sengsong nu i` | 299 |
-| 4 | `populasion i sengsong nu` | 299 |
-| 5 | `na populasion i sengsong` | 299 |
 ### Key Findings
-- **Best Perplexity:** 2-gram with 262
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
-- **Coverage:** Top-1000 patterns cover ~58% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
@@ -172,55 +211,86 @@ Bulgaria, capitat Sofia.`
 ![Markov Entropy](visualizations/markov_entropy.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
-| Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
-|---------|-------------|------------|------------------|-----------------|----------------|
-| **1** | 0.4437 | 1.360 | 2.83 | 5,950 | 55.6% |
-| **1** | 1.1631 | 2.239 | 8.71 | 235 | 0.0% |
-| **2** | 0.1938 | 1.144 | 1.42 | 16,706 | 80.6% |
-| **2** | 1.1662 | 2.244 | 5.45 | 2,042 | 0.0% |
-| **3** | 0.0836 | 1.060 | 1.14 | 23,561 | 91.6% |
-| **3** | 0.7209 | 1.648 | 2.73 | 11,105 | 27.9% |
-| **4** | 0.0382 🏆 | 1.027 | 1.06 | 26,653 | 96.2% |
-| **4** | 0.3731 🏆 | 1.295 | 1.67 | 30,233 | 62.7% |
-### Generated Text Samples
-Below are text samples generated from each Markov chain model:
 **Context Size 1:**
-1. `i sengsong dong hoi . guåha 1 ] yan i isla ni mineddong - spike barker`
-2. `. guåha 103 , un lepblo - ña , krasnodar ) minisengsóng / web . katigoria`
-3. `' huyong as a ' gue ' aotoridat para otro na populasion i senso 2010 .`
 **Context Size 2:**
-1. `katigoria : dangkulo katigoria : chile`
-2. `. katigoria : dangkulo katigoria : china katigoria : yeogråfia katigoria : dangkulo katigoria : esta...`
-3. `i sengsong nu i senso 2010 . katigoria : estados unidos . guåha 4 154 200 na`
 **Context Size 3:**
-1. `. katigoria : tiempo`
-2. `nu i senso 2000 . katigoria : dangkulo katigoria : estados unidos`
-3. `na tataogues na populasion i sengsong nu i senso 2016 ( ine ) . katigoria : dangkulo katigoria`
 **Context Size 4:**
-1. `na tataogues na populasion i sengsong nu i senso 2010 . katigoria : dangkulo katigoria : estados uni...`
-2. `tataogues na populasion i sengsong nu i senso 2012 . katigoria : chapan katigoria : dangkulo`
-3. `i sengsong nu i senso 2014 . katigoria : dangkulo katigoria : estados unidos`
 ### Key Findings
-- **Best Predictability:** Context-4 with 96.2% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
-- **Memory Trade-off:** Larger contexts require more storage (30,233 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
@@ -236,64 +306,64 @@ Below are text samples generated from each Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Vocabulary Size | 2,085 |
-| Total Tokens | 25,270 |
-| Mean Frequency | 12.12 |
 | Median Frequency | 3 |
-| Frequency Std Dev | 73.61 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | i | 2,329 |
-| 2 | na | 1,512 |
-| 3 | gi | 982 |
-| 4 | katigoria | 774 |
-| 5 | estados | 459 |
-| 6 | unidos | 448 |
-| 7 | yan | 440 |
-| 8 | sengsong | 370 |
-| 9 | guåha | 356 |
-| 10 | ni | 339 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | säger | 2 |
-| 2 | ett | 2 |
-| 3 | så | 2 |
-| 4 | du | 2 |
-| 5 | skate | 2 |
-| 6 | med | 2 |
-| 7 | smaskiga | 2 |
-| 8 | löken | 2 |
-| 9 | tychy | 2 |
-| 10 | museon | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
-| Zipf Coefficient | 0.9652 |
-| R² (Goodness of Fit) | 0.986639 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
-| Top 100 | 62.8% |
-| Top 1,000 | 90.4% |
 | Top 5,000 | 0.0% |
 | Top 10,000 | 0.0% |
 ### Key Findings
-- **Zipf Compliance:** R²=0.9866 indicates excellent adherence to Zipf's law
-- **High Frequency Dominance:** Top 100 words cover 62.8% of corpus
-- **Long Tail:** -7,915 words needed for remaining 100.0% coverage
 ---
 ## 5. Word Embeddings Evaluation
@@ -306,24 +376,108 @@ Below are text samples generated from each Markov chain model:
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
-### Model Comparison
-| Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
-|-------|------------|-----------|----------|----------|----------|
-| **mono_32d** | 546 | 32 | 1.807 | 0.436 | 0.0314 🏆 |
-| **mono_64d** | 546 | 64 | 1.773 | 0.400 | 0.0096 |
-| **mono_128d** | 546 | 128 | 1.779 | 0.401 | 0.0020 |
-| **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 ### Key Findings
-- **Best Isotropy:** mono_32d with 0.0314 (more uniform distribution)
-- **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
-- **Vocabulary Coverage:** All models cover 546 words
-- **Recommendation:** 100d for balanced semantic capture and efficiency
 ---
-## 6. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
@@ -331,11 +485,12 @@ Below are text samples generated from each Markov chain model:
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
-| Tokenizer | **32k BPE** | Best compression (4.11x) with low UNK rate |
-| N-gram | **5-gram** | Lowest perplexity (262) |
-| Markov | **Context-4** | Highest predictability (96.2%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
@@ -525,7 +680,8 @@ If you use these models in your research, please cite:
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
-  publisher = {HuggingFace},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
@@ -541,7 +697,8 @@ MIT License - Free for academic and commercial use.
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 ---
 *Generated by Wikilangs Models Pipeline*
-*Report Date: 2025-12-28 22:40:52*

 metrics:
   - name: best_compression_ratio
     type: compression
+    value: 4.243
   - name: best_isotropy
     type: isotropy
+    value: 0.0518
   - name: vocabulary_size
     type: vocab
+    value: 0
+generated: 2026-01-03
 ---
 # CH - Wikilangs Models
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
+- N-gram models (2, 3, 4, 5-gram)
+- Markov chains (context of 1, 2, 3, 4 and 5)
 - Subword N-gram and Markov chains
+- Embeddings in various sizes and dimensions (aligned and unaligned)
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
+- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
+- [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
+![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
+![Tokenizer OOV](visualizations/tokenizer_oov.png)
+![Total Tokens](visualizations/tokenizer_total_tokens.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
+| **8k** | 3.977x | 3.99 | 0.1019% | 38,272 |
+| **16k** | 4.243x 🏆 | 4.26 | 0.1087% | 35,871 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
+**Sample 1:** `Doerun, nasong-song gi Estados Unidos. Guåha 774 na tataogues na populasion i se...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁do er un , ▁nasong - song ▁gi ▁estados ▁unidos ... (+16 more)` | 26 |
+| 16k | `▁doerun , ▁nasong - song ▁gi ▁estados ▁unidos . ▁guåha ... (+14 more)` | 24 |
+**Sample 2:** `Newhalen, nasong-song gi Estados Unidos. Guåha 190 na tataogues na populasion i ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁newha len , ▁nasong - song ▁gi ▁estados ▁unidos . ... (+15 more)` | 25 |
+| 16k | `▁newhalen , ▁nasong - song ▁gi ▁estados ▁unidos . ▁guåha ... (+14 more)` | 24 |
+**Sample 3:** `Larsen Bay, nasong-song gi Estados Unidos. Guåha 87 na tataogues na populasion i...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁larsen ▁bay , ▁nasong - song ▁gi ▁estados ▁unidos . ... (+14 more)` | 24 |
+| 16k | `▁larsen ▁bay , ▁nasong - song ▁gi ▁estados ▁unidos . ... (+14 more)` | 24 |
 ### Key Findings
+- **Best Compression:** 16k achieves 4.243x compression
+- **Lowest UNK Rate:** 8k with 0.1019% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
+![N-gram Unique](visualizations/ngram_unique.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
+| N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
+|--------|---------|------------|---------|----------------|------------------|-------------------|
+| **2-gram** | Word | 181 | 7.50 | 496 | 68.1% | 100.0% |
+| **2-gram** | Subword | 228 | 7.83 | 869 | 71.1% | 100.0% |
+| **3-gram** | Word | 134 🏆 | 7.07 | 582 | 70.7% | 100.0% |
+| **3-gram** | Subword | 1,281 | 10.32 | 4,543 | 36.5% | 79.7% |
+| **4-gram** | Word | 158 | 7.30 | 842 | 66.6% | 100.0% |
+| **4-gram** | Subword | 3,667 | 11.84 | 12,416 | 26.2% | 57.0% |
 ### Top 5 N-grams by Size
+**2-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `i sengsong` | 364 |
+| 2 | `nu i` | 329 |
+| 3 | `i senso` | 310 |
+| 4 | `na populasion` | 309 |
+| 5 | `populasion i` | 308 |
+**3-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `nu i senso` | 308 |
+| 2 | `na tataogues na` | 304 |
 | 3 | `na populasion i` | 304 |
+| 4 | `tataogues na populasion` | 304 |
+| 5 | `i sengsong nu` | 299 |
+**4-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
 | 1 | `na tataogues na populasion` | 304 |
 | 2 | `tataogues na populasion i` | 303 |
+| 3 | `na populasion i sengsong` | 299 |
+| 4 | `sengsong nu i senso` | 299 |
+| 5 | `populasion i sengsong nu` | 299 |
+**2-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `a _` | 4,934 |
+| 2 | `i _` | 4,206 |
+| 3 | `n a` | 2,921 |
+| 4 | `a n` | 2,812 |
+| 5 | `_ i` | 2,769 |
+**3-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `_ i _` | 2,254 |
+| 2 | `_ n a` | 1,827 |
+| 3 | `n a _` | 1,562 |
+| 4 | `_ g i` | 1,306 |
+| 5 | `_ m a` | 1,153 |
+**4-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `_ n a _` | 1,359 |
+| 2 | `_ g i _` | 957 |
+| 3 | `s o n g` | 950 |
+| 4 | `_ i _ s` | 792 |
+| 5 | `o n g _` | 757 |
 ### Key Findings
+- **Best Perplexity:** 3-gram (word) with 134
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
+- **Coverage:** Top-1000 patterns cover ~57% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
 ![Markov Entropy](visualizations/markov_entropy.png)
+![Markov Contexts](visualizations/markov_contexts.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
+| Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
+|---------|---------|-------------|------------|------------------|-----------------|----------------|
+| **1** | Word | 0.4921 | 1.406 | 2.63 | 5,466 | 50.8% |
+| **1** | Subword | 1.0948 | 2.136 | 7.84 | 226 | 0.0% |
+| **2** | Word | 0.1702 | 1.125 | 1.32 | 14,200 | 83.0% |
+| **2** | Subword | 1.1295 | 2.188 | 5.29 | 1,769 | 0.0% |
+| **3** | Word | 0.0593 | 1.042 | 1.09 | 18,551 | 94.1% |
+| **3** | Subword | 0.7380 | 1.668 | 2.81 | 9,336 | 26.2% |
+| **4** | Word | 0.0213 🏆 | 1.015 | 1.03 | 19,968 | 97.9% |
+| **4** | Subword | 0.3911 | 1.311 | 1.72 | 26,134 | 60.9% |
+### Generated Text Samples (Word-based)
+Below are text samples generated from each word-based Markov chain model:
+**Context Size 1:**
+1. `i islan guåhan si nanå ña ti ha nå an i senso ine giya guåhan i`
+2. `na tataogues na petsona siha manma å ñao i kotturan ñiha gi i dos botkan ni`
+3. `gi sankattan na populasion i mina tres manggimen tuba ginen i taotao ya ma ganna i`
+**Context Size 2:**
+1. `i sengsong nu i senso unidos`
+2. `nu i proteksion i tano jesukristo gi fecha ni 25 disiembre`
+3. `populasion i sengsong nu i senso unidos`
+**Context Size 3:**
+1. `na populasion i sengsong nu i senso unidos`
+2. `na tataogues na populasion i sengsong nu i senso unidos`
+3. `tataogues na populasion i sengsong nu i senso unidos`
+**Context Size 4:**
+1. `na tataogues na populasion i sengsong nu i senso unidos`
+2. `tataogues na populasion i sengsong nu i senso unidos`
+3. `i sengsong nu i senso unidos`
+### Generated Text Samples (Subword-based)
+Below are text samples generated from each subword-based Markov chain model:
 **Context Size 1:**
+1. `_eyanso'_giterge`
+2. `asipa_dinakoso_m`
+3. `nsi_nandiki_u_pa`
 **Context Size 2:**
+1. `a_åchokkas_na_kri`
+2. `i_ta_magu_gi_i_ha`
+3. `na'neho_"thunidos`
 **Context Size 3:**
+1. `_i_caste_pies_dang`
+2. `_na_populasifiku)_`
+3. `na_tan_atten-ñiha_`
 **Context Size 4:**
+1. `_na_po'lu_na_aterit`
+2. `_gi_estorio_ni'_kad`
+3. `song-song_nu_i_akti`
 ### Key Findings
+- **Best Predictability:** Context-4 (word) with 97.9% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
+- **Memory Trade-off:** Larger contexts require more storage (26,134 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
 | Metric | Value |
 |--------|-------|
+| Vocabulary Size | 1,918 |
+| Total Tokens | 22,697 |
+| Mean Frequency | 11.83 |
 | Median Frequency | 3 |
+| Frequency Std Dev | 73.74 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | i | 2,327 |
+| 2 | na | 1,513 |
+| 3 | gi | 972 |
+| 4 | unidos | 448 |
+| 5 | yan | 436 |
+| 6 | sengsong | 370 |
+| 7 | guåha | 356 |
+| 8 | ni | 339 |
+| 9 | nu | 335 |
+| 10 | populasion | 333 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | av | 2 |
+| 2 | berit | 2 |
+| 3 | larsson | 2 |
+| 4 | hemliga | 2 |
+| 5 | tycker | 2 |
+| 6 | att | 2 |
+| 7 | var | 2 |
+| 8 | rolig | 2 |
+| 9 | ett | 2 |
+| 10 | du | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
+| Zipf Coefficient | 0.9581 |
+| R² (Goodness of Fit) | 0.986461 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
+| Top 100 | 63.1% |
+| Top 1,000 | 91.3% |
 | Top 5,000 | 0.0% |
 | Top 10,000 | 0.0% |
 ### Key Findings
+- **Zipf Compliance:** R²=0.9865 indicates excellent adherence to Zipf's law
+- **High Frequency Dominance:** Top 100 words cover 63.1% of corpus
+- **Long Tail:** -8,082 words needed for remaining 100.0% coverage
 ---
 ## 5. Word Embeddings Evaluation
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
+### 5.1 Cross-Lingual Alignment
+> *Note: Multilingual alignment visualization not available for this language.*
+### 5.2 Model Comparison
+| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
+|-------|-----------|----------|------------------|---------------|----------------|
+| **mono_32d** | 32 | 0.0518 🏆 | 0.6801 | N/A | N/A |
+| **mono_64d** | 64 | 0.0071 | 0.8792 | N/A | N/A |
+| **mono_128d** | 128 | 0.0017 | 0.8741 | N/A | N/A |
 ### Key Findings
+- **Best Isotropy:** mono_32d with 0.0518 (more uniform distribution)
+- **Semantic Density:** Average pairwise similarity of 0.8111. Lower values indicate better semantic separation.
+- **Alignment Quality:** No aligned models evaluated in this run.
+- **Recommendation:** 128d aligned for best cross-lingual performance
+---
+## 6.  Morphological Analysis (Experimental)
+> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
+This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
+### 6.1 Productivity & Complexity
+| Metric | Value | Interpretation | Recommendation |
+|--------|-------|----------------|----------------|
+| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
+| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
+### 6.2 Affix Inventory (Productive Units)
+These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
+#### Productive Prefixes
+| Prefix | Examples |
+|--------|----------|
+| `-ma` | matematika, matai, mansen |
+#### Productive Suffixes
+| Suffix | Examples |
+|--------|----------|
+| `-a` | matematika, kana, bånda |
+| `-n` | sanhayan, monhayan, fan |
+| `-on` | organisasion, aplikasion, adelanton |
+| `-an` | sanhayan, monhayan, fan |
+| `-ia` | bibliografia, termania, indonesia |
+| `-ion` | organisasion, aplikasion, atministrasion |
+### 6.3 Bound Stems (Lexical Roots)
+Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
+*No significant bound stems detected.*
+### 6.4 Affix Compatibility (Co-occurrence)
+This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
+| Prefix | Suffix | Frequency | Examples |
+|--------|--------|-----------|----------|
+| `-ma` | `-a` | 17 words | matematika, manfa |
+| `-ma` | `-n` | 13 words | mansen, mandarin |
+| `-ma` | `-an` | 6 words | manguayan, man |
+| `-ma` | `-on` | 4 words | matutuhon, madison |
+| `-ma` | `-ia` | 1 words | malaysia, maria |
+### 6.5 Recursive Morpheme Segmentation
+Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
+| Word | Suggested Split | Confidence | Stem |
+|------|-----------------|------------|------|
+| makonsidera | **`ma-konsidera`** | 4.5 | `konsidera` |
+| matutuhon | **`ma-tutuh-on`** | 3.0 | `tutuh` |
+| manguayan | **`ma-nguay-an`** | 3.0 | `nguay` |
+| pennsylvania | **`pennsylv-an-ia`** | 3.0 | `pennsylv` |
+| manmatutuhon | **`ma-nmatutuh-on`** | 3.0 | `nmatutuh` |
+| machulijan | **`ma-chulij-an`** | 3.0 | `chulij` |
+| manofisinan | **`ma-nofisin-an`** | 3.0 | `nofisin` |
+| masasangan | **`ma-sasang-an`** | 3.0 | `sasang` |
+| matematika | **`ma-tematika`** | 1.5 | `tematika` |
+| organisasion | **`organisas-ion`** | 1.5 | `organisas` |
+| aplikasion | **`aplikas-ion`** | 1.5 | `aplikas` |
+| adelanton | **`adelant-on`** | 1.5 | `adelant` |
+| bibliografia | **`bibliograf-ia`** | 1.5 | `bibliograf` |
+| manamerikanu | **`ma-namerikanu`** | 1.5 | `namerikanu` |
+| indonesia | **`indones-ia`** | 1.5 | `indones` |
+### 6.6 Linguistic Interpretation
+> **Automated Insight:**
+The language CH appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
 ---
+## 7. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
+| Tokenizer | **16k BPE** | Best compression (4.24x) |
+| N-gram | **3-gram** | Lowest perplexity (134) |
+| Markov | **Context-4** | Highest predictability (97.9%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
+  doi = {10.5281/zenodo.18073153},
+  publisher = {Zenodo},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
+- 🤝 Sponsor: [Featherless AI](https://featherless.ai)
 ---
 *Generated by Wikilangs Models Pipeline*
+*Report Date: 2026-01-03 10:06:23*

models/embeddings/monolingual/ch_128d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c70139eede0356c2aeb600acc0358b3d5553ac4276060a02b299bcffe2777ed7
-size 1024568094

 version https://git-lfs.github.com/spec/v1
+oid sha256:0df9cc84b978b1e115c6cee71e84a836b069e938ed89ba0990d33fb6e76d469f
+size 1024535659

models/embeddings/monolingual/ch_128d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 128,
   "version": "monolingual",
   "training_params": {
-    "dim": 128,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 546
 }

   "dimension": 128,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 128
   },
+  "vocab_size": 515
 }

models/embeddings/monolingual/ch_32d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f057607a3b0833c3df0ee72e3fe1c4413a523a5dd0469280f13475660f999d12
-size 256148766

 version https://git-lfs.github.com/spec/v1
+oid sha256:5ddf21da0a6cf2c7c20309473f6133a07aa8e95d376e57c53cba79dae0c66d67
+size 256140139

models/embeddings/monolingual/ch_32d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 32,
   "version": "monolingual",
   "training_params": {
-    "dim": 32,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 546
 }

   "dimension": 32,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 32
   },
+  "vocab_size": 515
 }

models/embeddings/monolingual/ch_64d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:69ee5065092313be64cc01d2936ed7d6e432a293090c7622e103b54dd6831728
-size 512288542

 version https://git-lfs.github.com/spec/v1
+oid sha256:765adc5bcc8f79ac3c37ad1e8f88947431aa2cfc778fa7b216fa71374b23bb88
+size 512271979

models/embeddings/monolingual/ch_64d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 64,
   "version": "monolingual",
   "training_params": {
-    "dim": 64,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 546
 }

   "dimension": 64,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 64
   },
+  "vocab_size": 515
 }

models/subword_markov/ch_markov_ctx1_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2d8bcbadf13f8c04b3bf774db4566c21081c309daade48a05ab547c4d55019eb
-size 18449

 version https://git-lfs.github.com/spec/v1
+oid sha256:f6da9b7f5e8c16da9d614a2d3c0d921e4219da4002fa71393d3c02e1a9f995e1
+size 17474

models/subword_markov/ch_markov_ctx1_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "subword",
   "language": "ch",
-  "unique_contexts": 235,
-  "total_transitions": 176029
 }

   "context_size": 1,
   "variant": "subword",
   "language": "ch",
+  "unique_contexts": 226,
+  "total_transitions": 151624
 }

models/subword_markov/ch_markov_ctx2_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:392ca8b7bc544928287f25cc942dc9006c35872f09a7d9b575d39b713087ca72
-size 77639

 version https://git-lfs.github.com/spec/v1
+oid sha256:264f038954ff6478301dcba734b78924744419ec3051afb803ea4a1d438a05f0
+size 67586

models/subword_markov/ch_markov_ctx2_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "subword",
   "language": "ch",
-  "unique_contexts": 2042,
-  "total_transitions": 175418
 }

   "context_size": 2,
   "variant": "subword",
   "language": "ch",
+  "unique_contexts": 1769,
+  "total_transitions": 151064
 }

models/subword_markov/ch_markov_ctx3_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:25f072530585a2aaaf12cfc35aca7b5b9527eaaf67a243d06f265d95cf78682d
-size 217872

 version https://git-lfs.github.com/spec/v1
+oid sha256:84fdaafcb5e52af0cd9e249ff651ea983749cc925557218121f0d7a85653c797
+size 194392

models/subword_markov/ch_markov_ctx3_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "subword",
   "language": "ch",
-  "unique_contexts": 11105,
-  "total_transitions": 174807
 }

   "context_size": 3,
   "variant": "subword",
   "language": "ch",
+  "unique_contexts": 9336,
+  "total_transitions": 150504
 }

models/subword_markov/ch_markov_ctx4_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e8e96aca1a4a4e61ba52c2373da77f16710ba861c64b8653015fe9cb4c3d6b24
-size 455667

 version https://git-lfs.github.com/spec/v1
+oid sha256:6d71078f4f1468532db5ba657998c8efdbdd211b304e67e2ca77d88635a671f7
+size 397638

models/subword_markov/ch_markov_ctx4_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "subword",
   "language": "ch",
-  "unique_contexts": 30233,
-  "total_transitions": 174196
 }

   "context_size": 4,
   "variant": "subword",
   "language": "ch",
+  "unique_contexts": 26134,
+  "total_transitions": 149944
 }

models/subword_ngram/ch_2gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7ecc6ad2f5caba9c4acc2722951d35812ddfd0748e35945b8eb1019aee27202a
-size 13892

 version https://git-lfs.github.com/spec/v1
+oid sha256:17c2d47a218e898e3bac6e60fc2de2f9adcc53c160b68806900ec23c9eefabc1
+size 12108

models/subword_ngram/ch_2gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "subword",
   "language": "ch",
-  "unique_ngrams": 1028,
-  "total_ngrams": 176029
 }

   "n": 2,
   "variant": "subword",
   "language": "ch",
+  "unique_ngrams": 869,
+  "total_ngrams": 151624
 }

models/subword_ngram/ch_3gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1354a9304538e4be2c8b97bc1d85d3fdb590e5129f6777ec6756e2fc90f0cbb5
-size 57312

 version https://git-lfs.github.com/spec/v1
+oid sha256:baced867ea20a6cb5b359ec97882d2ae2f822b1ffec9ac4b650656fab50d1c45
+size 49528

models/subword_ngram/ch_3gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "subword",
   "language": "ch",
-  "unique_ngrams": 5255,
-  "total_ngrams": 175418
 }

   "n": 3,
   "variant": "subword",
   "language": "ch",
+  "unique_ngrams": 4543,
+  "total_ngrams": 151064
 }

models/subword_ngram/ch_4gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0a808ff2dd136df8018eda9014fabe65264cfd207f4de524397af0d418d2138b
-size 161401

 version https://git-lfs.github.com/spec/v1
+oid sha256:73fe928791c8492066a094c970ee0eadde56b39c4b8aba9c2aa2e34db5083a0a
+size 140036

models/subword_ngram/ch_4gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 4,
   "variant": "subword",
   "language": "ch",
-  "unique_ngrams": 13909,
-  "total_ngrams": 174807
 }

   "n": 4,
   "variant": "subword",
   "language": "ch",
+  "unique_ngrams": 12416,
+  "total_ngrams": 150504
 }

models/tokenizer/ch_tokenizer_16k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:634be9f783e2a54aba945f22daf1ece4b4429563bdaa5c3384c14fe647a9d98e
-size 497034

 version https://git-lfs.github.com/spec/v1
+oid sha256:37f30794a09f73e4b923805fe5be00975376a1e1b79fcb80c07ed2173f7a5575
+size 494430

models/tokenizer/ch_tokenizer_16k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/ch_tokenizer_8k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:bfd71404f2712cc4642d3377318b688e8290037be26719b11d368397ae231246
-size 372129

 version https://git-lfs.github.com/spec/v1
+oid sha256:0def86c055e5d9fbbd3798691f18b2b80932bfcd504e36c5288479a36306643b
+size 372290

models/tokenizer/ch_tokenizer_8k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/vocabulary/ch_vocabulary.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c46420d7f4bd7e37454cbfb5ab90ff0e34bce723dfd3188face6f244d8bee5dd
-size 35121

 version https://git-lfs.github.com/spec/v1
+oid sha256:a8bd347e6037147c0f3dda9c25ece68c2d3cc7280b3681049fe2758a1b89d662
+size 32241

models/vocabulary/ch_vocabulary_metadata.json CHANGED Viewed

@@ -1,15 +1,16 @@
 {
   "language": "ch",
-  "vocabulary_size": 2085,
   "statistics": {
-    "type_token_ratio": 0.20342884628598915,
     "coverage": {
-      "top_100": 0.5450766165051879,
-      "top_1000": 0.7851302137016423,
-      "top_5000": 0.9683570397856112
     },
-    "hapax_count": 3836,
-    "hapax_ratio": 0.6478635365647695,
-    "total_documents": 611
   }
 }

 {
   "language": "ch",
+  "vocabulary_size": 1918,
+  "variant": "full",
   "statistics": {
+    "type_token_ratio": 0.20998403163257548,
     "coverage": {
+      "top_100": 0.5441031100296555,
+      "top_1000": 0.7878488327883811,
+      "top_5000": 0.9801155805642157
     },
+    "hapax_count": 3605,
+    "hapax_ratio": 0.6527249683143219,
+    "total_documents": 560
   }
 }