An Arabic tokenizer built from scratch — outperforms Gemini, GPT-4o, Claude, and Llama 3 on Arabic. The foundation for the coming Arabic-first language model.
1.32× tokens/word ArabicBeats Gemini by 12%64,000 vocabOpen source
How it works
Four pillars. One method.
01
Encoding weighed for Arabic
The architecture treats Arabic as first-class Unicode, not as multi-byte sequences. Result: 1.32 tokens/word instead of 5.93 in traditional ByteLevel BPE.
02
Battle numbers, with receipts
A real benchmark on Arabic and English sentences. AuraQnizer 1.32 — Gemini 1.50 — GPT-4o 1.70 — Claude 1.80 — Llama 3 2.10. Lower is better.
03
Zero <unk>
Built-in byte fallback: it never fails on a character, glyph, or code symbol. Any Unicode — covered.
04
Foundation for the next model
AuraQnizer powers AuraBitNet — a 4B Arabic model with ternary weights, in training on a BitNet architecture optimized for low-memory deployment.