Nexal Downloads Center

Access all Nexal language resources, datasets, and tools to accelerate your AI language projects

Browse Lexicon Tokenizer Tools

Nexal Lexicon & Corpora

Complete vocabulary and language datasets for training and development

Nexal Lexicon (JSON)

Complete vocabulary with 500 roots, parts of speech, glosses, and notes in JSON format.

Format: JSON Size: 500 entries

Nexal Lexicon (CSV)

Starter lexicon with 500 roots in CSV format for easy integration with data tools.

Format: CSV Size: 500 entries

Tokenized Corpus

Space-tokenized corpus with root + gloss tokens + POS tags for training subword models.

Format: TXT Size: 500 lines

Roots Corpus

Roots-only corpus with one root per line for vocabulary-focused training.

Format: TXT Size: 500 roots
Root Part of Speech Gloss Notes

Tokenizer Tools & Scripts

Ready-to-run scripts for building tokenizers with popular NLP frameworks

Hugging Face Tokenizer

Python script to train a Hugging Face-compatible BPE tokenizer from your corpus.

Format: Python Dependencies: tokenizers, transformers

SentencePiece Trainer

Python script to train SentencePiece models (BPE/unigram/word/char) for Nexal.

Format: Python Dependencies: sentencepiece
Hugging Face Tokenizer
SentencePiece Trainer
Python
Python