Arabic,
represented as it's spoken.

An Arabic tokenizer built from scratch — outperforms Gemini, GPT-4o, Claude, and Llama 3 on Arabic. The foundation for the coming Arabic-first language model.

1.32× tokens/word ArabicBeats Gemini by 12%64,000 vocabOpen source

How it works

Four pillars. One method.

01

Encoding weighed for Arabic

The architecture treats Arabic as first-class Unicode, not as multi-byte sequences. Result: 1.32 tokens/word instead of 5.93 in traditional ByteLevel BPE.

02

Battle numbers, with receipts

A real benchmark on Arabic and English sentences. AuraQnizer 1.32 — Gemini 1.50 — GPT-4o 1.70 — Claude 1.80 — Llama 3 2.10. Lower is better.

03

Zero <unk>

Built-in byte fallback: it never fails on a character, glyph, or code symbol. Any Unicode — covered.

04

Foundation for the next model

AuraQnizer powers AuraBitNet — a 4B Arabic model with ternary weights, in training on a BitNet architecture optimized for low-memory deployment.

By the numbers

The numbers, without exaggeration.

Arabic

1.32×

tokens/word — lower is better

English

1.05×

Beats GPT-4o (1.10×)

Vocabulary

64,000

SentencePiece Unigram

Training data

1GB+

Real Arabic + English

AURAQUBIT · 2026

An intentional quiet.
We're shaping something worthy.

Polishing the details, quietly. Back soon — with something worth this pause.

In preparation

AURAQUBIT · Made in Oman

Arabic,represented as it's spoken.

Four pillars. One method.

The numbers, without exaggeration.

Arabic,
represented as it's spoken.