A compact, efficient tokenizer for the Nexal language with morphological segmentation for NLP pipelines
Get started with the Nexal tokenizer in minutes
Open a terminal and navigate to the folder containing nexal_tokenizer.py
python3 -i
Run these commands in the Python REPL:
from nexal_tokenizer import NexalTokenizer, demo
# Run the bundled demo
demo()
# Or use directly
t = NexalTokenizer()
print(t.tokenize("i-sel gol dom-o"))
print(t.tokenize("in-ru-pe-ri"))
print(t.tokenize("u-tar-n!"))
print(t.tokenize("a-zi fu book-o ti"))
Run the demo directly from your shell:
python3 -c "from nexal_tokenizer import demo; demo()"
Different ways to integrate the tokenizer into your workflow
Create a test script to tokenize multiple examples
from nexal_tokenizer import NexalTokenizer
t = NexalTokenizer()
examples = [
"i-sel gol dom-o",
"in-ru-pe-ri",
"u-tar-n!",
"a-zi fu book-o ti",
"mi-ver la i-lum-ta"
]
for s in examples:
print("S:", s)
print("T:", t.tokenize(s))
Create automated tests to ensure tokenizer correctness
import unittest
from nexal_tokenizer import NexalTokenizer
class TestNexalTokenize(unittest.TestCase):
def setUp(self):
self.t = NexalTokenizer()
def test_basic(self):
self.assertEqual(self.t.tokenize("i-sel gol dom-o"),
['i-', 'sel', 'gol', 'dom', '-o'])
self.assertEqual(self.t.tokenize("in-ru-pe-ri"),
['in-', 'ru', '-pe', '-ri'])
if __name__ == '__main__':
unittest.main()
Tokenize entire files for NLP pipeline integration
from nexal_tokenizer import NexalTokenizer
t = NexalTokenizer()
with open('nexal_lines.txt', 'r') as fin, \
open('nexal_tokens.txt', 'w') as fout:
for line in fin:
tokens = t.tokenize(line)
fout.write(' '.join(tokens) + '\n')
Test the tokenizer directly in your browser
Enter Nexal text below to see how it would be tokenized:
How to use the tokenizer in your machine learning workflows
The tokenizer separates person prefixes, root chunks, and suffix tokens, making morphological segmentation explicit for downstream models.
Use the output as training data for SentencePiece or Hugging Face tokenizers. The segmented output provides ideal pre-tokenization for BPE algorithms.
Use the pre-tokenized output with is_split_into_words=True
or let HF's BPE learn on space-separated tokens from our tokenizer.
# Example: Training a subword tokenizer with Nexal output
from tokenizers import BertWordPieceTokenizer
# Train on tokenized Nexal data
tokenizer = BertWordPieceTokenizer()
tokenizer.train(files=["nexal_tokens.txt"], vocab_size=5000)
# Or use with pre-tokenized input
from transformers import AutoTokenizer
model_tokenizer = AutoTokenizer.from_pretrained("your-model")
tokens = nexal_tokenizer.tokenize("i-sel gol dom-o")
encoded = model_tokenizer(tokens, is_split_into_words=True)
Common issues and how to resolve them
Issue: ModuleNotFoundError: No module named 'nexal_tokenizer'
Solution: Ensure nexal_tokenizer.py
is in your current directory or on PYTHONPATH
.
Issue: Tokens not separating as expected
Solution: Check hyphen usage. The tokenizer treats pe
and -pe
differently by design.
Issue: Uppercase characters causing issues
Solution: Use tokenize(text, normalize=True)
to lowercase and normalize input.