AuraQubitProductsAuraQnizer
Foundation Models · AuraQnizer
Production · v3

Arabic,
represented as it's spoken.

An Arabic tokenizer built from scratch — outperforms Gemini, GPT-4o, Claude, and Llama 3 on Arabic. The foundation for the coming Arabic-first language model.

1.32× tokens/word ArabicBeats Gemini by 12%64,000 vocabOpen source
How it works

Four pillars. One method.

01
Encoding weighed for Arabic
The architecture treats Arabic as first-class Unicode, not as multi-byte sequences. Result: 1.32 tokens/word instead of 5.93 in traditional ByteLevel BPE.
02
Battle numbers, with receipts
A real benchmark on Arabic and English sentences. AuraQnizer 1.32 — Gemini 1.50 — GPT-4o 1.70 — Claude 1.80 — Llama 3 2.10. Lower is better.
03
Zero <unk>
Built-in byte fallback: it never fails on a character, glyph, or code symbol. Any Unicode — covered.
04
Foundation for the next model
AuraQnizer powers AuraBitNet — a 4B Arabic model with ternary weights, in training on a BitNet architecture optimized for low-memory deployment.
By the numbers

The numbers, without exaggeration.

Arabic
1.32×
tokens/word — lower is better
English
1.05×
Beats GPT-4o (1.10×)
Vocabulary
64,000
SentencePiece Unigram
Training data
1GB+
Real Arabic + English

A language deserves an encoding of its size.

Explore also