The Expanding Universe of Unicode – billpg industries™

Before Unicode, digital text lived in a fragmented world of 8-bit encodings. ASCII had settled in as the good-enough-for-English core, taking up the first half of codes, but the other half was a mish-mash of regional code pages that mapped characters differently depending on locale. One set for accented Latin letters, another set for Cyrillic.

Each system carried its own assumptions, collisions, and blind spots. Unicode emerged as a unifying vision. a single character set for all human languages, built on a 16-bit foundation. All developers had to do was swap their 8-bit loops for 16-bit loops. Some bristled that half the bytes were all zeros, but this was for the greater good.

16-bits made 65,536 code points. It was a bold expansion from the cramped quarters of ASCII, a ceremonial leap into linguistic universality. This was enough, it was thought, to encode the entirety of written expression. After all, how many characters could the world possibly need?

“Remember this girls. None of you can be first, but all of you can be next.”

🐹 I absolutely UTF-8 those zero bytes.

It was in this world of 16-bit Unicode that UTF-8 emerged. This had the notable benefit of being compatible with 7-bit ASCII, using the second half of ASCII to encode the non-ASCII side of Unicode as multiple byte sequences.

If your code knew how to work with ASCII it would probably work with UTF-8 without any changes needed. So long as it passed over those multi-byte sequences without attempting to interpret them, you’d be fine. The trade-off was that while ASCII characters only took up one byte, most of Unicode took three bytes, with the letters-with-accents occupying the two-bytes-per-character range.

This wasn’t the hard limit of UTF-8. The initial design allowed for up to 31-bit character codes. Plenty of room for expansion!

🔨 Knocking on UTF-16’s door.

As linguistic diversity, historical scripts, emoji, and symbolic notations clamoured for representation, the Unicode Consortium realised their neat two-byte packages would not be enough and needed to be extended. The world could have moved over to the UTF-8 where there was plenty of room, but too many systems had 16-bit Unicode baked in.

The community that doggedly stuck with ASCII and its 8-bits per character design must have felt a bit smug seeing the rest of the world move to 16-bit Unicode. They stuck with their good-enough-for-English encoding and were rewarded with UTF-8 with its ASCII compatibility and plenty of room for expansion. Meanwhile, those early adopters who made the effort to move to the purity of their fixed size 16-bit encoding were told that their characters weren’t going to be fixed size any more.

This would be the plan to move beyond the 65,536 limit. Two unused blocks of 1024 codes were set aside. If you wanted a character in the original range of 16-bit values, you’d use the 16-bit code as normal, but if you wanted a character from the new extended space, you had to put two 16-bit codes from these blocks together. The first 16-bit code gave you 10 bits (1024=2¹⁰) and the second 16-bit code you 10 more bits, making 20 bits in total.

(Incidentally, we need two separate blocks to allow for self-synchronization. If we only had one block of 1024 codes, we could not drop into the middle of a stream of 16-bit codes and simply start reading. It is only by having two blocks you know that if the first 16-bit code you read is from the second block, you know to discard that one and continue afresh from the next one.)

The original Unicode was rechristened the “Basic Multilingual Plane” or plane zero, while the 20-bit codes allowed by this new encoding were split into 16 separate “planes” of 65,536 codes each, numbered from 1 to (hexadecimal) 10. UTF-16 with its one million possible codes was born.

UTF-8 was standardized to match UTF-16 limits. Plane zero characters were represented by one, two or three byte sequences as before, but the new extended planes required four byte sequences. The longer byte sequences were still there but cordoned off with a “Here be dragons” sign, their byte patterns declared meaningless.

“Don’t need quarters, don’t need dimes, to call a friend of mine. Don’t need computer or TV to have a real good time.”

🧩 What If We Run Out Again?

Unicode’s architects once believed 64K code points would suffice. Then they expanded to a little over a million. But what if we run out again?

It’s not as far-fetched as it sounds. Scripts evolve. Emoji proliferate. Symbolic domains—mathematical, musical, magical—keep expanding. And if humanity ever starts encoding dreams, gestures, or interspecies diplomacy, we might need more.

Fortunately, UTF-8 is quietly prepared. Recall that its original design allowed for up to 31-bit code points, using up to 7 bytes per character. The technical definition of UTF-8 restricts itself to 21 bits, but the scaffolding for expansion is still there.

On the other hand, UTF-16 was never designed to handle more than a million codes. There’s no large unused range of unused code in plane zero to add more bits. But what if we need more?

For now, we can relax a little because we’re way short. Of the 17 planes, only the first four and last three have any codes allocated to them. Ten planes are unused. Could we pull the same trick with that unused space again?

🧮 An Encoding Scheme for UTF-16X

Let’s say we do decide to extend UTF-16 to 31 bits in order to match UTF-8’s original ceiling. Here’s a proposal:

Planes C and D (0xC0000 to 0xDFFFF) are mostly unused, aside from two reserved codes at the end of each.
We designate 49152 codes (2¹⁴+2¹⁵) from each plane as encoding units. This number is close to √2³¹, making it a natural fit.
A Plane C code followed by a Plane D code form a composite: (C×49152+D)
This yields over 2.4 billion combinations, which is more than enough to cover the 31-bit space.

This leaves us with these encoding patterns:

Basic Unicode is represented by a single 16-bit code.
The 16 extended planes by two 16-bit codes.
The remaining 31-bit space as two codes from the C and D planes, or four 16-bit codes.

This scheme would require a new decoder logic, but it mirrors the original surrogate pair trick with mathematical grace. It’s a ritual echo, scaled to the future. Code that only knows about the 17 planes will continue to work with this encoding as long as it simply passes the codes along rather than trying to apply any meaning to them, just like UTF-8 does.

🔥 An Encoding and Decoding Example

Let’s say we want to encode a Unicode code point 123456789 using the UTF-16X proposal above.

To encode into a plane C and plane D pair, divide and mod by 49152:

Plane C index: C = floor(123456789 / 49152) = 2512
Plane D index: D = 123456789 % 49152 = 21381

To get the actual UTF-16 values, add accordingly:

Plane C code: 0xC0000 + 2512 = 0xC09C0
Plane D code: 0xD0000 + 21381 = 0xD537D

To decode these two UTF-16 codes back, mask off the C and D plane bits to multiply and add the two values:

2512 × 49152 + 21381 = 123456789

🧠 Reader’s Exercise

Try rewriting the encoding and decoding steps above using only bitwise operations. Remember that 49,152 was chosen for its bit pattern and that you can replace multiplication and division with combinations of shifts and additions.

🌌 The Threshold of Plane B

Unicode’s expansion has been deliberate, almost ceremonial. Planes 4 through A remain largely untouched, a leisurely frontier for future scripts, symbols, and ceremonial glyphs. We allocate codes as needed, with time to reflect, revise, and ritualize.

But once Plane B begins to fill—once we cross into 0xB0000—we’ll be standing at a threshold. That’s the moment to decide how, or if, we go beyond?

As I write this, around a third of all possible code-points have been allocated. What will we be thinking that day in the future? Will those last few blocks be enough for what we need? Whatever we choose, it should be deliberate. Not just a technical fix, but a narrative decision. A moment of protocol poetry.

Because encoding isn’t just compression—it’s commitment. And Plane B is where the future begins.

“I could say Bella Bella, even Sehr Wunderbar. Each language only helps me tell you how grand you are.”

Credits
📸 “Dasha in a bun” by Keri Ivy. (Creative Commons)
📸 “Toco Toucan” by Bernard Dupont. (Creative Commons)
📸 “No Public Access” by me.
🤖 Editorial assistance and ceremonial decoding provided by Echoquill, my AI collaborator.