Nexal Tokenizer | Intertech ARD Systems

Quick Start Guide

Get started with the Nexal tokenizer in minutes

Sanity Check

Open a terminal and navigate to the folder containing nexal_tokenizer.py

Bash

python3 -i

Interactive Testing

Run these commands in the Python REPL:

Python

from nexal_tokenizer import NexalTokenizer, demo

# Run the bundled demo
demo()

# Or use directly
t = NexalTokenizer()
print(t.tokenize("i-sel gol dom-o"))
print(t.tokenize("in-ru-pe-ri"))
print(t.tokenize("u-tar-n!"))
print(t.tokenize("a-zi fu book-o ti"))

Command Line Demo

Run the demo directly from your shell:

Bash

python3 -c "from nexal_tokenizer import demo; demo()"

Usage Examples

Different ways to integrate the tokenizer into your workflow

Script Usage

Create a test script to tokenize multiple examples

Python

from nexal_tokenizer import NexalTokenizer

t = NexalTokenizer()
examples = [
    "i-sel gol dom-o",
    "in-ru-pe-ri",
    "u-tar-n!",
    "a-zi fu book-o ti",
    "mi-ver la i-lum-ta"
]

for s in examples:
    print("S:", s)
    print("T:", t.tokenize(s))

Unit Testing

Create automated tests to ensure tokenizer correctness

Python

import unittest
from nexal_tokenizer import NexalTokenizer

class TestNexalTokenize(unittest.TestCase):
    def setUp(self):
        self.t = NexalTokenizer()

    def test_basic(self):
        self.assertEqual(self.t.tokenize("i-sel gol dom-o"),
                         ['i-', 'sel', 'gol', 'dom', '-o'])
        self.assertEqual(self.t.tokenize("in-ru-pe-ri"),
                         ['in-', 'ru', '-pe', '-ri'])

if __name__ == '__main__':
    unittest.main()

Batch Processing

Tokenize entire files for NLP pipeline integration

Python

from nexal_tokenizer import NexalTokenizer
t = NexalTokenizer()

with open('nexal_lines.txt', 'r') as fin, \
     open('nexal_tokens.txt', 'w') as fout:
    for line in fin:
        tokens = t.tokenize(line)
        fout.write(' '.join(tokens) + '\n')

NLP Pipeline Integration

How to use the tokenizer in your machine learning workflows

Morphological Segmentation

The tokenizer separates person prefixes, root chunks, and suffix tokens, making morphological segmentation explicit for downstream models.

Subword Tokenization

Use the output as training data for SentencePiece or Hugging Face tokenizers. The segmented output provides ideal pre-tokenization for BPE algorithms.

Hugging Face Integration

Use the pre-tokenized output with is_split_into_words=True or let HF's BPE learn on space-separated tokens from our tokenizer.

Python

# Example: Training a subword tokenizer with Nexal output
from tokenizers import BertWordPieceTokenizer

# Train on tokenized Nexal data
tokenizer = BertWordPieceTokenizer()
tokenizer.train(files=["nexal_tokens.txt"], vocab_size=5000)

# Or use with pre-tokenized input
from transformers import AutoTokenizer
model_tokenizer = AutoTokenizer.from_pretrained("your-model")
tokens = nexal_tokenizer.tokenize("i-sel gol dom-o")
encoded = model_tokenizer(tokens, is_split_into_words=True)

Troubleshooting Guide

Common issues and how to resolve them

Module Not Found

Issue: ModuleNotFoundError: No module named 'nexal_tokenizer'

Solution: Ensure nexal_tokenizer.py is in your current directory or on PYTHONPATH.

Unexpected Tokenization

Issue: Tokens not separating as expected

Solution: Check hyphen usage. The tokenizer treats pe and -pe differently by design.

Case Sensitivity

Issue: Uppercase characters causing issues

Solution: Use tokenize(text, normalize=True) to lowercase and normalize input.

Nexal Tokenizer API & Documentation

Quick Start Guide

Sanity Check

Interactive Testing

Command Line Demo

Usage Examples

Script Usage

Unit Testing

Batch Processing

Interactive Demo

Nexal Tokenizer Simulator

Sample Inputs:

NLP Pipeline Integration

Morphological Segmentation

Subword Tokenization

Hugging Face Integration

Troubleshooting Guide

Module Not Found

Unexpected Tokenization

Case Sensitivity