Nexal Tokenizer API & Documentation

A compact, efficient tokenizer for the Nexal language with morphological segmentation for NLP pipelines

Quick Start Live Demo

Quick Start Guide

Get started with the Nexal tokenizer in minutes

1

Sanity Check

Open a terminal and navigate to the folder containing nexal_tokenizer.py

Bash
python3 -i
2

Interactive Testing

Run these commands in the Python REPL:

Python
from nexal_tokenizer import NexalTokenizer, demo # Run the bundled demo demo() # Or use directly t = NexalTokenizer() print(t.tokenize("i-sel gol dom-o")) print(t.tokenize("in-ru-pe-ri")) print(t.tokenize("u-tar-n!")) print(t.tokenize("a-zi fu book-o ti"))
3

Command Line Demo

Run the demo directly from your shell:

Bash
python3 -c "from nexal_tokenizer import demo; demo()"

Usage Examples

Different ways to integrate the tokenizer into your workflow

Script Usage

Create a test script to tokenize multiple examples

Python
from nexal_tokenizer import NexalTokenizer t = NexalTokenizer() examples = [ "i-sel gol dom-o", "in-ru-pe-ri", "u-tar-n!", "a-zi fu book-o ti", "mi-ver la i-lum-ta" ] for s in examples: print("S:", s) print("T:", t.tokenize(s))

Unit Testing

Create automated tests to ensure tokenizer correctness

Python
import unittest from nexal_tokenizer import NexalTokenizer class TestNexalTokenize(unittest.TestCase): def setUp(self): self.t = NexalTokenizer() def test_basic(self): self.assertEqual(self.t.tokenize("i-sel gol dom-o"), ['i-', 'sel', 'gol', 'dom', '-o']) self.assertEqual(self.t.tokenize("in-ru-pe-ri"), ['in-', 'ru', '-pe', '-ri']) if __name__ == '__main__': unittest.main()

Batch Processing

Tokenize entire files for NLP pipeline integration

Python
from nexal_tokenizer import NexalTokenizer t = NexalTokenizer() with open('nexal_lines.txt', 'r') as fin, \ open('nexal_tokens.txt', 'w') as fout: for line in fin: tokens = t.tokenize(line) fout.write(' '.join(tokens) + '\n')

Interactive Demo

Test the tokenizer directly in your browser

Nexal Tokenizer Simulator

Enter Nexal text below to see how it would be tokenized:

Click "Tokenize" to see results

Sample Inputs:

NLP Pipeline Integration

How to use the tokenizer in your machine learning workflows

Morphological Segmentation

The tokenizer separates person prefixes, root chunks, and suffix tokens, making morphological segmentation explicit for downstream models.

Subword Tokenization

Use the output as training data for SentencePiece or Hugging Face tokenizers. The segmented output provides ideal pre-tokenization for BPE algorithms.

Hugging Face Integration

Use the pre-tokenized output with is_split_into_words=True or let HF's BPE learn on space-separated tokens from our tokenizer.

Python
# Example: Training a subword tokenizer with Nexal output from tokenizers import BertWordPieceTokenizer # Train on tokenized Nexal data tokenizer = BertWordPieceTokenizer() tokenizer.train(files=["nexal_tokens.txt"], vocab_size=5000) # Or use with pre-tokenized input from transformers import AutoTokenizer model_tokenizer = AutoTokenizer.from_pretrained("your-model") tokens = nexal_tokenizer.tokenize("i-sel gol dom-o") encoded = model_tokenizer(tokens, is_split_into_words=True)

Troubleshooting Guide

Common issues and how to resolve them

Module Not Found

Issue: ModuleNotFoundError: No module named 'nexal_tokenizer'

Solution: Ensure nexal_tokenizer.py is in your current directory or on PYTHONPATH.

Unexpected Tokenization

Issue: Tokens not separating as expected

Solution: Check hyphen usage. The tokenizer treats pe and -pe differently by design.

Case Sensitivity

Issue: Uppercase characters causing issues

Solution: Use tokenize(text, normalize=True) to lowercase and normalize input.