Unicode – billpg industries™

It started, as these things often do, with a tiny spark of inspiration. I was looking at the way Unicode builds those racially diverse family emoji. These use ZWJ sequences to glue together adults, children, and skin‑tone modifiers into a single little glyph. It’s clever, constrained, and surprisingly elegant. It gave me an idea.

What if you could do the same thing for flags? Unicode already has national flag emojis along with 🏳️‍🌈, 🏳️‍⚧️, 🏴‍☠️, 🏁,🏳️,🏴,🚩 and the countries-in-our-hearts, 🏴󠁧󠁢󠁥󠁮󠁧󠁿, 🏴󠁧󠁢󠁳󠁣󠁴󠁿, 🏴󠁧󠁢󠁷󠁬󠁳󠁿. But no more. The people in charge of assigning codes have decided this is too much of a geopolitical and culture-war nightmare so these are all the flags we’re going to get.

It was with this and the build-a-family codes that gave me the idea. A minimal system. A harmless system that would allow Unicode to avoid the minefield.

Famous last words.

“Not sure I told you, but I really like your teeth. That hairy coat of yours with nothing underneath. Not sure you have a name, so I will call you Keith.”

Just Horizontal Stripes. How Hard Could It Be?

My starting point was beautifully simple. We already have colourful square emojis, so put several in a row with a code that says “take those colours and turn them into a striped flag.”

Want to identify with the British suffragettes? No problem. Type “🟩⬜🟪” and an end marker and you’ll get the classic “Give Women Votes” banner.

That’s it. No more than six stripes. No fancy geometry. No overlays. No heraldry. Just a neat little way to express the British suffragette flag without needing a bespoke emoji or any of the political minefield that would come with it.

It felt clean. It felt doable. It felt like something Unicode might actually consider.

But I wasn’t done.

“We’re just like you, only differently inclined.”

Most pride flags are horizontal stripes, but many flags prefer vertical stripe ls a d this this seemed like a very simple extension to the idea. Two new code points. One for horizontal stripes and another for vertical stripes.

Still simple. Still manageable. Still not terrifying.

“I am making little watercolors and pastels, I think they will come out all right.”

The trans pride flag doesn’t use red and blue, but pink and baby-blue. Pastel shades. While the trans flag was already there, anyone wanting to use the trans colours would need to make do with bright-red and bright-blue.

But I’m already inventing new code points, so why not add one more. A pastel modifier that you could attach to any of the colour emojis, even outside of my primary flag composition plan. “💚” and the pastel modifier equalled a pastel-green heart.

This was the moment I should have stopped. I had taken the first step toward a graphics language, but I didn’t see it yet.

“Wake up where the clouds are far behind. Where troubles melt like lemon drops, high above the chimney tops, that’s where you’ll find me.”

Crosses, Saltires, and Cantons

Once you start thinking about flags, you can’t avoid the classics. The Nordic cross, St Andrew’s saltire, America’s star-spangled canton, the British layered geometry.

So I added codes that allowed you add various kinds of crosses, each with a colour, and a canton code that meant “this block is a new flag that will be embedded in the top left”.

Some flags have heraldic symbols on them like stars or animals. I was never shooting for pixel perfect representations, so I added codes that would allow you to add an emoji to your flag in a variety of positions.

Flag of 🇦🇺 Australia?

Blue stripe.
Southern Cross emoji on the fly.
White star emoji on the lower hoist.
Start Canton.
Blue Stripe.
White Saltire.
Red Saltire.
White Cross.
Red Cross.
End Canton.

And suddenly my simple stripe system needed rules for layering, masking, and region‑specific drawing.

The Moment of Realisation

I stepped back and looked at what I had created. Orientation rules, layering rules, colour‑modification rules, region‑placement rules, geometry rules, compositing rules, rule rules…

And it hit me with the force of a thousand W3C specifications. I was reinventing SVG. In codepoints. Not just SVG but a restricted, weirdly encoded, Unicode‑flavoured SVG with all the complexity and none of the tooling.

I had built a graphics language disguised as emoji. Unicode would never accept it. Unicode encodes meaning, not appearance. The moment you introduce a system that generates arbitrary graphics, you’ve left the world of characters and entered the world of graphics engines.

I had crossed the boundary from symbol encoding into procedural graphics, and I hadn’t even noticed until I was halfway through designing a colour‑modifier block.

My original idea — the tiny, innocent, three‑stripe suffragette flag — was lovely, but ideas like this have a way of expanding. You add one rule, then another, then another, and before you know it you’re writing a miniature graphics specification and wondering why your “simple emoji idea” now needs a colour‑space definition and a geometry engine.

Unicode didn’t ask for SVG‑Lite‑But‑Worse. I just accidentally built it.

Honestly, it was a fun ride.

Credits
📸 “Marche des fiertés Toulouse 2011” by Guillaume Paumier. (Creative Commons)
📸 “Statue in the ground of Tenison Woods Catholic College in Mt Gambier” by “denisbin”. (Creative Commons)

Before Unicode, digital text lived in a fragmented world of 8-bit encodings. ASCII had settled in as the good-enough-for-English core, taking up the first half of codes, but the other half was a mish-mash of regional code pages that mapped characters differently depending on locale. One set for accented Latin letters, another set for Cyrillic.

Each system carried its own assumptions, collisions, and blind spots. Unicode emerged as a unifying vision. a single character set for all human languages, built on a 16-bit foundation. All developers had to do was swap their 8-bit loops for 16-bit loops. Some bristled that half the bytes were all zeros, but this was for the greater good.

16-bits made 65,536 code points. It was a bold expansion from the cramped quarters of ASCII, a ceremonial leap into linguistic universality. This was enough, it was thought, to encode the entirety of written expression. After all, how many characters could the world possibly need?

“Remember this girls. None of you can be first, but all of you can be next.”

🐹 I absolutely UTF-8 those zero bytes.

It was in this world of 16-bit Unicode that UTF-8 emerged. This had the notable benefit of being compatible with 7-bit ASCII, using the second half of ASCII to encode the non-ASCII side of Unicode as multiple byte sequences.

If your code knew how to work with ASCII it would probably work with UTF-8 without any changes needed. So long as it passed over those multi-byte sequences without attempting to interpret them, you’d be fine. The trade-off was that while ASCII characters only took up one byte, most of Unicode took three bytes, with the letters-with-accents occupying the two-bytes-per-character range.

This wasn’t the hard limit of UTF-8. The initial design allowed for up to 31-bit character codes. Plenty of room for expansion!

🔨 Knocking on UTF-16’s door.

As linguistic diversity, historical scripts, emoji, and symbolic notations clamoured for representation, the Unicode Consortium realised their neat two-byte packages would not be enough and needed to be extended. The world could have moved over to the UTF-8 where there was plenty of room, but too many systems had 16-bit Unicode baked in.

The community that doggedly stuck with ASCII and its 8-bits per character design must have felt a bit smug seeing the rest of the world move to 16-bit Unicode. They stuck with their good-enough-for-English encoding and were rewarded with UTF-8 with its ASCII compatibility and plenty of room for expansion. Meanwhile, those early adopters who made the effort to move to the purity of their fixed size 16-bit encoding were told that their characters weren’t going to be fixed size any more.

This would be the plan to move beyond the 65,536 limit. Two unused blocks of 1024 codes were set aside. If you wanted a character in the original range of 16-bit values, you’d use the 16-bit code as normal, but if you wanted a character from the new extended space, you had to put two 16-bit codes from these blocks together. The first 16-bit code gave you 10 bits (1024=2¹⁰) and the second 16-bit code you 10 more bits, making 20 bits in total.

(Incidentally, we need two separate blocks to allow for self-synchronization. If we only had one block of 1024 codes, we could not drop into the middle of a stream of 16-bit codes and simply start reading. It is only by having two blocks you know that if the first 16-bit code you read is from the second block, you know to discard that one and continue afresh from the next one.)

The original Unicode was rechristened the “Basic Multilingual Plane” or plane zero, while the 20-bit codes allowed by this new encoding were split into 16 separate “planes” of 65,536 codes each, numbered from 1 to (hexadecimal) 10. UTF-16 with its one million possible codes was born.

UTF-8 was standardized to match UTF-16 limits. Plane zero characters were represented by one, two or three byte sequences as before, but the new extended planes required four byte sequences. The longer byte sequences were still there but cordoned off with a “Here be dragons” sign, their byte patterns declared meaningless.

“Don’t need quarters, don’t need dimes, to call a friend of mine. Don’t need computer or TV to have a real good time.”

🧩 What If We Run Out Again?

Unicode’s architects once believed 64K code points would suffice. Then they expanded to a little over a million. But what if we run out again?

It’s not as far-fetched as it sounds. Scripts evolve. Emoji proliferate. Symbolic domains—mathematical, musical, magical—keep expanding. And if humanity ever starts encoding dreams, gestures, or interspecies diplomacy, we might need more.

Fortunately, UTF-8 is quietly prepared. Recall that its original design allowed for up to 31-bit code points, using up to 7 bytes per character. The technical definition of UTF-8 restricts itself to 21 bits, but the scaffolding for expansion is still there.

On the other hand, UTF-16 was never designed to handle more than a million codes. There’s no large unused range of unused code in plane zero to add more bits. But what if we need more?

For now, we can relax a little because we’re way short. Of the 17 planes, only the first four and last three have any codes allocated to them. Ten planes are unused. Could we pull the same trick with that unused space again?

🧮 An Encoding Scheme for UTF-16X

Let’s say we do decide to extend UTF-16 to 31 bits in order to match UTF-8’s original ceiling. Here’s a proposal:

Planes C and D (0xC0000 to 0xDFFFF) are mostly unused, aside from two reserved codes at the end of each.
We designate 49152 codes (2¹⁴+2¹⁵) from each plane as encoding units. This number is close to √2³¹, making it a natural fit.
A Plane C code followed by a Plane D code form a composite: (C×49152+D)
This yields over 2.4 billion combinations, which is more than enough to cover the 31-bit space.

This leaves us with these encoding patterns:

Basic Unicode is represented by a single 16-bit code.
The 16 extended planes by two 16-bit codes.
The remaining 31-bit space as two codes from the C and D planes, or four 16-bit codes.

This scheme would require a new decoder logic, but it mirrors the original surrogate pair trick with mathematical grace. It’s a ritual echo, scaled to the future. Code that only knows about the 17 planes will continue to work with this encoding as long as it simply passes the codes along rather than trying to apply any meaning to them, just like UTF-8 does.

🔥 An Encoding and Decoding Example

Let’s say we want to encode a Unicode code point 123456789 using the UTF-16X proposal above.

To encode into a plane C and plane D pair, divide and mod by 49152:

Plane C index: C = floor(123456789 / 49152) = 2512
Plane D index: D = 123456789 % 49152 = 21381

To get the actual UTF-16 values, add accordingly:

Plane C code: 0xC0000 + 2512 = 0xC09C0
Plane D code: 0xD0000 + 21381 = 0xD537D

To decode these two UTF-16 codes back, mask off the C and D plane bits to multiply and add the two values:

2512 × 49152 + 21381 = 123456789

🧠 Reader’s Exercise

Try rewriting the encoding and decoding steps above using only bitwise operations. Remember that 49,152 was chosen for its bit pattern and that you can replace multiplication and division with combinations of shifts and additions.

🌌 The Threshold of Plane B

Unicode’s expansion has been deliberate, almost ceremonial. Planes 4 through A remain largely untouched, a leisurely frontier for future scripts, symbols, and ceremonial glyphs. We allocate codes as needed, with time to reflect, revise, and ritualize.

But once Plane B begins to fill—once we cross into 0xB0000—we’ll be standing at a threshold. That’s the moment to decide how, or if, we go beyond?

As I write this, around a third of all possible code-points have been allocated. What will we be thinking that day in the future? Will those last few blocks be enough for what we need? Whatever we choose, it should be deliberate. Not just a technical fix, but a narrative decision. A moment of protocol poetry.

Because encoding isn’t just compression—it’s commitment. And Plane B is where the future begins.

“I could say Bella Bella, even Sehr Wunderbar. Each language only helps me tell you how grand you are.”

Credits
📸 “Dasha in a bun” by Keri Ivy. (Creative Commons)
📸 “Toco Toucan” by Bernard Dupont. (Creative Commons)
📸 “No Public Access” by me.
🤖 Editorial assistance and ceremonial decoding provided by Echoquill, my AI collaborator.

Category: Unicode

On no! I’ve invented SVG badly!