8086 Segmented Memory Was a Good Idea. (Almost.)

How’s that for a click‑bait title. Who in their right mind would defend the monstrosity that is the 8086 segmented memory architecture?

By the time I got into 8086 assembly, PCs were mostly 80286‑based, but everything still ran under DOS. A “normal” machine only had the canonical 640KB of conventional memory and every assembly book had a dreaded chapter explaining segmentation. I still have the battle scars. Near pointers, far pointers, and the infamous “wherever‑you‑are” pointers.

So when I found myself making similar architectural decisions for Hearthfire, my project to design a hypothetical 1980s‑era home computer with the benefit of hindsight and without the burden of actually manufacturing silicon, I finally understood why Intel made the choices they did. Segmentation could have been a solid foundation for the future except for one small detail that ruined everything.

And that detail, awkwardly, was us.

Software developers.

We broke it.

“There are many things that I would like to say to you but I don’t know how.”

What Is 8086 Segmented Memory?

The 8086 could address 1MB of memory which was a huge amount when 64KB was considered luxurious. To do that, it needed 20‑bit addresses.

But instead of giving programmers 20‑bit registers, Intel kept the familiar 16‑bit registers and the missing four bits came from a second set of 16‑bit registers called segment registers. Each memory access combined one of these segment register and a 16‑bit offset.

Every segment started 16 bytes after the previous one. Segments overlapped heavily. The same physical address could be expressed in many different combinations of segment and offset. To read a byte, you loaded a segment (the starting point) and an offset (how far from that point you wanted to go). Internally, the CPU shifted the segment left by four bits and added the offset.

Two 16‑bit values to produce a 20‑bit address.

No wonder everyone hated it.

The Last Shall Be First

It’s fashionable to mock segmentation, but in its original context, it was rather clever.

It’s tempting to see the 8086 as the “first” chip in the x86 lineage, especially since every successor still carries its real‑mode DNA. But the story starts earlier.

The 8080 was the workhorse of the mid‑1970s. It ran CP/M, the dominant OS of the era, and its 16‑bit address bus gave it a tidy 64KB world. (The Z80 was its slightly more famous cousin.)

As software grew, that 64KB ceiling became a problem. Customers wanted more memory, but they also wanted their existing assembly code to keep running. Assembly was still a primary language written by humans and you can’t just recompile assembly for a new CPU. Moving to a new architecture really would mean rewriting everything.

In this light, Intel’s pitch was simple.

“We’ve divided memory into 64KB segments. Load your 8080 code into one of those segments, point all the segment registers there and it’ll run exactly as before. No rewrites. No drama.”

And from a certain angle, the 8086 looks like a forward‑thinking design. A segment and offset together form 32 bits, enough to address 4GB. The chip only had 20 address pins, but the next one could have more. Extend the overlap between segments and it’ll all fall into place.

In theory, the 8086 could have scaled gracefully until the day we needed 64‑bit addressing.

Except for one thing…

The First Shall Be Last

The name “segment” reveals Intel’s intent. We weren’t supposed to treat memory as a continuous 1MB space. We were supposed to treat it as lots of 64KB blocks, each identified by an opaque selector.

Your program asks the OS for memory and it hands you a segment value. You load that into a segment register and use the offset to index within it. Need more than 64KB? Allocate two blocks.

But developers didn’t want two blocks. They wanted a flat address space.

Once people realised that segments were always 16 bytes apart, the normalised pointer emerged. The segment register became the upper 16 bits of a 20‑bit address; the offset supplied the 4 lower bits. With a little ceremony, you could treat memory as almost continuous.

By the time the 80286 arrived, this practice was entrenched. Changing the overlap from 16 bytes to 256 bytes would have broken everything. So Intel added a new mode for the 286’s fancy features and 24‑bit addressing, while old code stayed in real mode.

It took the 80386 and its virtual‑8086 mode before mainstream software could finally escape the 1MB limit.

If we had collectively agreed to treat segments as actual segments, the 8086 architecture might have lasted decades.

“The other night, dear, as I lay sleeping, I dreamed I held you in my arms. When I awoke, dear, I was mistaken, so I hung my head and cried.”

What Should They Have Done?

Here’s where I admit there was no trivial fix.

What we needed, in hindsight, was to treat segments as true selectors — opaque handles with no arithmetic meaning. If you can’t assume the next segment is 16 bytes ahead, you’re forced to use segmentation as intended.

But that would have required per‑segment metadata, storage for that metadata and hardware to manage it. All in an era when 64KB was still considered a lot.

And even if Intel had implemented such a system, it would only take one clever developer discovering an undocumented shortcut to turn that behaviour into a requirement for the next chip.

So Hearthfire won’t use 8086‑style segmentation. But it remains a valuable object lesson.

Credits
📸 “We Picked A Poppy” by “A Guy Named Nyal”. (Creative Commons)
📸 “I Broke The Build” by Dirk Haun. (Creative Commons)

May 29, 2026

On no! I’ve invented SVG badly!

It started, as these things often do, with a tiny spark of inspiration. I was looking at the way Unicode builds those racially diverse family emoji. These use ZWJ sequences to glue together adults, children, and skin‑tone modifiers into a single little glyph. It’s clever, constrained, and surprisingly elegant. It gave me an idea.

What if you could do the same thing for flags? Unicode already has national flag emojis along with 🏳️‍🌈, 🏳️‍⚧️, 🏴‍☠️, 🏁,🏳️,🏴,🚩 and the countries-in-our-hearts, 🏴󠁧󠁢󠁥󠁮󠁧󠁿, 🏴󠁧󠁢󠁳󠁣󠁴󠁿, 🏴󠁧󠁢󠁷󠁬󠁳󠁿. But no more. The people in charge of assigning codes have decided this is too much of a geopolitical and culture-war nightmare so these are all the flags we’re going to get.

It was with this and the build-a-family codes that gave me the idea. A minimal system. A harmless system that would allow Unicode to avoid the minefield.

Famous last words.

“Not sure I told you, but I really like your teeth. That hairy coat of yours with nothing underneath. Not sure you have a name, so I will call you Keith.”

Just Horizontal Stripes. How Hard Could It Be?

My starting point was beautifully simple. We already have colourful square emojis, so put several in a row with a code that says “take those colours and turn them into a striped flag.”

Want to identify with the British suffragettes? No problem. Type “🟩⬜🟪” and an end marker and you’ll get the classic “Give Women Votes” banner.

That’s it. No more than six stripes. No fancy geometry. No overlays. No heraldry. Just a neat little way to express the British suffragette flag without needing a bespoke emoji or any of the political minefield that would come with it.

It felt clean. It felt doable. It felt like something Unicode might actually consider.

But I wasn’t done.

“We’re just like you, only differently inclined.”

Most pride flags are horizontal stripes, but many flags prefer vertical stripe ls a d this this seemed like a very simple extension to the idea. Two new code points. One for horizontal stripes and another for vertical stripes.

Still simple. Still manageable. Still not terrifying.

“I am making little watercolors and pastels, I think they will come out all right.”

The trans pride flag doesn’t use red and blue, but pink and baby-blue. Pastel shades. While the trans flag was already there, anyone wanting to use the trans colours would need to make do with bright-red and bright-blue.

But I’m already inventing new code points, so why not add one more. A pastel modifier that you could attach to any of the colour emojis, even outside of my primary flag composition plan. “💚” and the pastel modifier equalled a pastel-green heart.

This was the moment I should have stopped. I had taken the first step toward a graphics language, but I didn’t see it yet.

“Wake up where the clouds are far behind. Where troubles melt like lemon drops, high above the chimney tops, that’s where you’ll find me.”

Crosses, Saltires, and Cantons

Once you start thinking about flags, you can’t avoid the classics. The Nordic cross, St Andrew’s saltire, America’s star-spangled canton, the British layered geometry.

So I added codes that allowed you add various kinds of crosses, each with a colour, and a canton code that meant “this block is a new flag that will be embedded in the top left”.

Some flags have heraldic symbols on them like stars or animals. I was never shooting for pixel perfect representations, so I added codes that would allow you to add an emoji to your flag in a variety of positions.

Flag of 🇦🇺 Australia?

Blue stripe.
Southern Cross emoji on the fly.
White star emoji on the lower hoist.
Start Canton.
Blue Stripe.
White Saltire.
Red Saltire.
White Cross.
Red Cross.
End Canton.

And suddenly my simple stripe system needed rules for layering, masking, and region‑specific drawing.

The Moment of Realisation

I stepped back and looked at what I had created. Orientation rules, layering rules, colour‑modification rules, region‑placement rules, geometry rules, compositing rules, rule rules…

And it hit me with the force of a thousand W3C specifications. I was reinventing SVG. In codepoints. Not just SVG but a restricted, weirdly encoded, Unicode‑flavoured SVG with all the complexity and none of the tooling.

I had built a graphics language disguised as emoji. Unicode would never accept it. Unicode encodes meaning, not appearance. The moment you introduce a system that generates arbitrary graphics, you’ve left the world of characters and entered the world of graphics engines.

I had crossed the boundary from symbol encoding into procedural graphics, and I hadn’t even noticed until I was halfway through designing a colour‑modifier block.

My original idea — the tiny, innocent, three‑stripe suffragette flag — was lovely, but ideas like this have a way of expanding. You add one rule, then another, then another, and before you know it you’re writing a miniature graphics specification and wondering why your “simple emoji idea” now needs a colour‑space definition and a geometry engine.

Unicode didn’t ask for SVG‑Lite‑But‑Worse. I just accidentally built it.

Honestly, it was a fun ride.

Credits
📸 “Marche des fiertés Toulouse 2011” by Guillaume Paumier. (Creative Commons)
📸 “Statue in the ground of Tenison Woods Catholic College in Mt Gambier” by “denisbin”. (Creative Commons)

May 8, 2026June 21, 2026

Hearthfire Part One: Lighting the Flame

Every so often a project lodges itself in your brain and refuses to leave quietly. This began as a late‑night thought experiment, the kind that keeps circling until you either build it or write it down. For months it’s been pacing around the edges of my mind.

What if I knew what I know now, but I lived in the 1980s and worked for a plucky home computer maker?

“The games you get today, well they might be very flash, but they’ll never beat the thrill of getting through jetpack.”

The spark came from a familiar place. I’d been following the Commander X16 project, a modern reimagining of what might have happened if Commodore had stayed the course with the 6502 instead of leaping to the Amiga. The X16’s operating system keeps the approachable, single‑tasking spirit of the C64 and even followed through to shipping real hardware you can buy today. Watching that project take shape made me wonder about a different branch of the timeline. An 8‑bit machine that embraced pre‑emptive multitasking, shared memory regions, and a more ceremonial sense of system design.

That curiosity sent me digging through old notes, where I recently found some early sketches I thought I’d lost. Thirty or so years ago, a colleague told me about how the keyboard interface worked on the Sinclair ZX Spectrum. A simple 50Hz timer on the CPU’s interrupt pin. Could that open the door to full multitasking on such a humble machine?

My notes from that conversation were rough, but the core idea was already there. A system where swapping a page wasn’t a hack but a first‑class architectural gesture. A machine that could perform true multitasking, not by accident or redirecting OS calls and other such clever trickery, but by design.

Hearthfire grew from that seed. A computer could treat its shared resources not as bottlenecks, but as a communal hearth. A warm centre where processes gather. A system shaped by the constraints of the era, but guided by a modern understanding of how operating systems can cooperate rather than compete.

And because every world needs a spirit to inhabit it, Hearthfire has Bytewisp 🔥, a tiny ember‑sprite who drifts between pages, curls up in unused bytes, and serves as a gentle reminder that even the most technical systems can have a little soul. You’ll see her appear throughout this series, offering commentary from the margins.

🔥 “Bill thinks ths is cute. I’m not so sure.”

This post is simply the declaration. Hearthfire exists — or at least the idea of it does. What follows in the next instalments is an attempt to get it out of my head and onto the page. To stop it keeping me awake at night and start shaping it into something coherent. But do bear in mind I’m probably not going to actually build this. It has been two decades since I wrote a line of C code or 6502 assembly. But if anyone does want to go ahead and build something to this design, please do with my blessing.

🔥 “If someone builds it, I promise to keep the pages warm.”

In Part Two, we’ll begin with the memory model. Sixteen 4K windows into a much larger space and a page‑swapping mechanism that makes pre‑emptive multitasking not only possible but elegant. We’ll talk about how Intel got tantalisingly close to a great architecture with the 8086, but dropped the ball at the last moment.

For now, consider this the lighting of the first flame.

Welcome to Hearthfire 🔥.

Credits
📸 “Cat on a laptop” by Doug Woods. (Creative Commons)
🌈 Andrew Langstaff for cluing me into how the ZX Spectrum works.
(And for the round of drinks.)

February 10, 2026

The Road Not Taken: A World Where IPv4 Evolved

IP address exhaustion is still with us. Back in the early 1990s, when the scale of the problem first became obvious, the projections were grim. Some estimates had us running out of IPv4 addresses as early as 2005 and even the optimistic ones didn’t push much beyond the following decade.

As the years passed, we got clever. Or perhaps, more desperate.

We managed to put off that imminent exhaustion for decades, breaking the Internet in the process in small, subtle ways. Want multiple home machines behind a single IP? No problem — we invented NAT. ISPs want in on that “multiple devices, one address” trick? Sure, have some Carrier‑Grade NAT. Want to run a server on your home machine? Er… no.

And through all of this, IPv6 has been sitting there patiently. It fixes the address shortage. It fixes autoconfiguration. It fixes fragmentation. It fixes multicast. It fixes almost everything people complain about in IPv4.

But hardly anyone wants it.

The problem isn’t that IPv6 is bad, but that deploying it means spending money before your neighbors do and no one wants to be the first penguin off the ice shelf. So we’ve ended up in a long, awkward stalemate. IPv6 is waiting for adoption and IPv4 is stretched far beyond what anyone in 1981 imagined.

But what if it hadn’t gone that way? What if the “IP Next Generation” team that designed IPv6 had chosen a different path? One that extended IPv4 instead of replacing it.

Let’s take a visit to that parallel universe.

“Images of broken light which dance before me like a million eyes, they call me on and on across the universe. Thoughts meander like a restless wind inside a letter box, they tumble blindly as they make their way…”

1993 — The Birth of IPv4x

It’s 1993, and the IP‑Next‑Generation working group has gathered to decide the future of the Internet. The mood is earnest, a little anxious, and very aware that the world is depending on them.

One engineer proposes a bold idea: a brand‑new version of IP with 128‑bit addresses. It would need a new version number but it would finally give the Internet the address space it deserved. Clean. Modern. A fresh start. IPv6!

Another engineer pushes back. A brand‑new protocol sounds elegant but IPv4 is already everywhere. Routers, stacks, firmware, embedded systems, dial‑up modems, university labs, corporate backbones. Replacing it outright would take decades and no one wants to be the first to deploy something incompatible with the rest of the world.

So the group settles down and agrees what that would look like. People want the same IP they’re used to but with more space. But if this idea is going to have legs, there is one requirement that is going to be unavoidable.

The new protocol must work across existing IPv4 networks from day one.

The Version field must remain 4.
The destination must remain a globally routable 32‑bit IPv4 address.
The packet must look, smell, and route like IPv4 to any router that doesn’t understand the new system.

And so IPv4x is born.

In a nutshell, an IPv4x packet is a normal IPv4 packet, just with 128‑bit addresses. The first 32 bits of both the source and target address sit in their usual place in the header, while the extra 96 bits of each address (the “subspace”) are tucked into the first 24 bytes of the IPv4 body. A flag in the header marks the packet as IPv4x, so routers that understand the extension can read the full address, while routers that don’t simply ignore the extra data and forward it as usual.

Who owns all these new addresses? You do. If you own an IPv4 address, you automatically own the entire 96‑bit subspace beneath it. Every IPv4 address becomes the root of a vast extended address tree. It has to work this way because any router that doesn’t understand IPv4x will still route purely on the old 32‑bit address. There’s no point assigning part of your subspace to someone else — their packets will still land on your router whether you like it or not.

An IPv4 router sees a normal IPv4 packet and routes according to the 32‑bit target address in the header, while an IPv4x router sees the full 128‑bit target address and routes according to that instead.

This does mean that an ordinary home user with a single IPv4 address will suddenly find themselves in charge of 96-bits of address space they never asked for nor will ever use, but that’s fine. There are still large regions of the IPv4 space going unused.

“If you’re still there when it’s all over, I’m scared I’ll have to say that a part of you has gone.”

1996 — The First IPv4x RFC

By 1996, the IPv4x design had stabilized enough for the working group to publish its first formal RFC. It wasn’t a revolution so much as a careful extension of what already worked.

DNS received a modest update. A normal query still returned the familiar A record, but clients could now set a flag indicating “I understand IPv4x”. If the server had an extended address available, it could return a 128‑bit IPv4x record alongside the traditional one. Old resolvers ignored it. New resolvers used it. Nothing broke.

DHCP was updated in the same spirit. Servers could hand out either 32‑bit or 128‑bit addresses depending on client capability.

Dial‑up stacks were still distributed with modem software, not the OS, which turned out to be a blessing: the major dial‑up packages all added IPv4x support within a year.

Adoption was slow but steady. The key advantage was that the network didn’t have to change. If your machine spoke IPv4x but the next router didn’t, the packet still flowed. Old routers forwarded based on the top 32 bits. New routers used the full 128.

MIT and the First Large‑Scale Deployment

The first major adopter was MIT. They had been allocated the entire 18.0.0.0/8 block in the early ARPANET era and they were famously reluctant to give it up. Stories circulated about individual buildings — some zoned for fewer than a dozen residents — sitting on entire /24s simply because no one had ever needed to conserve addresses.

IPv4x gave them a way forward and to show their credentials as responsible stewards. Every IPv4 address in their /8 became the root of a 96‑bit subspace. MIT deployed IPv4x experimentally across their campus backbone and the results were encouraging. Nothing broke. Nothing needed to be replaced overnight. It was the perfect demonstration of the “no flag day” philosophy the IPng group had hoped for.

Their success reassured everyone else that IPv4x was not only viable, but practical. Other large networks began making small updates during their weekend maintenance windows.

Buoyed by this success, IANA announced a new policy. All /8 blocks that are currently unused are reserved for IPv4x only.

“Hypnotizing, mesmerizing me. She was everything I dreamed she’d be.”

2006 — Ten Years of IPv4x

By 2006, IPv4x had firmly established itself. Dial‑up was fading, broadband was everywhere, and homes with multiple computers were now normal. IPv4 hadn’t vanished — some ISPs and server farms stuck to an “if it ain’t broke” philosophy.

“IPv4x when we can. NAT when we must.”

Windows XP embodied this mindset. It always asked DNS for an IPv4x address first, falling back to IPv4 when necessary and relying on NAT only as a last resort.

Residential ISPs began deploying IPv4x in earnest. Customers who wanted a dedicated IPv4 address could still have one — for a fee. Everyone else received an IPv4x /64, giving them 64 bits for their own devices. ISPs used carrier‑grade NAT as a compatibility shim rather than a lifeline: if you needed to reach an IPv4‑only service, CGNAT stepped in while IPv4x traffic flowed natively and without ceremony.

The old IPv4 pool finally ran dry in 2006, just in time for the anniversary. There were still plenty of unused /8 blocks, but these had all been earmarked for IPv4x, and IANA refused to budge. If you wanted more addresses, they would have to be the IPv4x kind.

Peer‑to‑Peer and the IPv4x Backlash

IPv4x had its fans, but it also had one determined opponent: the music industry.

Under IPv4 and NAT, peer‑to‑peer networking had always been awkward, especially if you weren’t a nerd who understood IP addresses. If you wanted to participate in peer-to-peer, you needed to log into your router’s admin panel and mess with arcane configurations which every router manufacturer had a different name for. Many gave up and simply decided music piracy wasn’t for them.

IPv4x removed all that friction. Every device had a stable and globally reachable address. File‑sharing exploded as peer-to-peer software was simple to use. You didn’t need to poke about with your router, it all just worked.

One trade group identified IPv4x as the cause of the growth in music file sharing and ran with the slogan “The X is for exterminating your favorite bands.” That stung a little but it didn’t stick. Cheap international calls, multiplayer games, chat systems and early collaboration tools all flourished. IPv4x didn’t just make peer‑to‑peer easier but it made direct communication the default again.

“They’re Justified, and they’re Ancient and they drive an ice cream van. They’re Justified and they’re Ancient, with still no master plan. The last train left an hour ago, they were singing All Aboard, All bound for Mu Mu Land.”

2016 — The Tipping Point

By 2016, IPv4x was the norm. The only major holdouts were the tier‑1 backbones, but that had always been part of the plan. They didn’t need IPv4x at all as the top 32 bits were enough for global routing between continents. But mostly, their eye‑wateringly expensive hardware didn’t really need replacing.

A few websites still clung to IPv4, forcing ISPs to maintain CGNAT systems, until one popular residential ISP broke ranks.

“IPv4x, or don’t.”

For customers of this ISP, those last few IPv4‑only sites simply failed. Support staff were given a list of known‑broken websites and trained to offer an alternative plan if a customer insisted the fault lay with the ISP. Most customers just shrugged and moved on. As far as they were concerned, those websites were simply malfunctioning.

Eventually, a technically minded customer pieced together what was happening and blew the whistle. A few dogs barked, but almost no one cared. The ISP spun the story that these websites were using “out‑of‑date” technology, but not to worry, they had an option for customers who really needed CGNAT support, provided they were willing to pay for it.

2020 — The Pandemic and IPv4x’s Quiet Triumph

When the world locked down in 2020, IPv4x quietly proved its worth.

The most popular video‑conferencing platforms had long since adopted a hybrid model. The operators centralized authentication and security, but handed the actual media streams over to peer‑to‑peer connections. Under IPv4/NAT, that had always been fragile but under IPv4x it was effortless.

Remote desktop access surged as well. People had left their office machines running under their desks and IPv4x made connecting to them trivial. It simply worked.

“You run to catch up with the sun, but it’s sinking. Racing around to come up behind you again. The sun is the same in a relative way, but you’re older. Shorter of breath, and one day closer to death.”

2026 — Thirty Years of IPv4x

By 2026, as the world celebrated the thirtieth anniversary of IPv4x, only a few pain points remained. The boundary between the first 32 bits and the remaining 96 was still visible.

If you were a serious network operator, you wanted one of those 32-bit IP addresses to yourself which you could attach your own router to. If you weren’t important or wealthy enough for one of those, you were at the mercy of whoever owned the router that was connected to those top 32-bits. But it wasn’t a serious problem. The industry understood the historical baggage and lived with it.

Public DNS resolvers were still stuck on IPv4. They didn’t want to be — DNS clients had spoken IPv4x for years — but the long tail of ancient DHCP servers kept handing out 32‑bit addresses. As long as those relics survived in wiring cupboards and forgotten branch offices, DNS had to pretend it was still 1998.

MIT still held onto their legendary 18.0.0.0/8 allocation, but their entire network now lived comfortably inside 18.18.18.18/32. They remained ready to release the rest of the block if the world ever needed it.

It was around this time that a group of engineers rediscovered an old, abandoned proposal from the 1990s: a clean‑slate protocol called IPv6, which would have discarded all legacy constraints and started fresh with a new address architecture. Reading it in 2025 felt like peering into a parallel universe.

Some speculated that, in that world, the Internet might have fully migrated by now, leaving IPv4 behind entirely. Others argued that IPv4 would have clung on stubbornly, with address blocks being traded for eye‑watering sums and NAT becoming even more entrenched.

IPv4x had avoided both extremes. It hadn’t replaced IPv4 but absorbed it. It hadn’t required a revolution but enabled an evolution. In doing so, it had given the Internet a smooth transition that no one noticed until it was already complete.

“Bells will ring. Sun will shine. I’ll be his and he’ll be mine. We’ll love until the end of time and we’ll never be lonely anymore.”

Back in the Real World

Of course, none of this really happened.

IPv4x was never standardized, no university ever routed a 96‑bit subspace beneath its legacy /8, and the world never enjoyed a seamless, invisible transition to a bigger Internet. Instead, we built IPv6 as a clean break, asked the world to deploy it alongside IPv4, and then spent the next twenty‑five years waiting for everyone else to go first.

And while imagining the IPv4x world is fun, it’s also revealing. That universe would have carried forward a surprising amount of legacy. IPv6 wasn’t only only a big chunk of address pace but a conscious modernization of the Internet. In our IPv4x world, NAT would fade, but ARP and DHCP would linger. The architecture would still be a patchwork of 1980s assumptions stretched across a 128‑bit address space.

IPv6, for all its deployment pain, actually fixes these things. It gives us cleaner auto-configuration, simpler routing, better multicast, and a control plane designed for the modern Internet rather than inherited from the ARPANET. The road is longer, but the destination is better.

Still, imagining the IPv4x world is useful. It reminds us that the Internet didn’t have to fracture into two incompatible address families. It could have grown incrementally, compatibly, without a flag day. It could have preserved end‑to‑end connectivity as the default rather than the exception.

And yet, the real world isn’t a failure but a different story. IPv6 is spreading while IPv4 is receding. The transition is messy, but it is happening. And perhaps the most important lesson from our imaginary IPv4x universe is this.

The Internet succeeds not because we always choose the perfect design, but because we keep moving forward anyway.

Epilogue

It was while writing this speculative history that the idea for SixGate finally clicked for me. In this alternate timeline, there’s a moment when an old IPv4‑only router hands off to a router that understands IPv4x. The handover is seamless because there’s always a path from old to new. The extended IPv4x subspace lives under the IPv4 address and the transition is invisible.

In our real world — the one split between IPv4 and IPv6 — we don’t have that luxury. But it led me to realize that if only an IPv4 user had a reliable way to reach into the IPv6 world, the transition could be smoother and more organic.

That’s where SixGate steps in. A user stuck on IPv4 can ask the IPv6 world for a way in. By returning a special SRV record from the ip6.arpa space, the user receives a kind of magic portal, provided by someone who wants them to be able to connect. Not quite as seamless as the router handover in our parallel universe, but impressively close given the constraints we live with.

So I hope SixGate can grow into something real — something that helps us get there a little faster. Maybe it will give IPv6 the invisible on‑ramp that IPv4x enjoyed in that parallel world.

Either way, imagining the road not taken has made the road we’re on feel a little clearer, and a little more hopeful.

And yes, this whole piece was a sneaky way to get you to read my SixGate proposal. Go on. You know you want to.

Credits
📸 “Chiricahua Mountains, Portal, AZ” by Karen Fasimpaur. (Creative Commons)
📸 “Splitting Up” by Damian Gadal. (Creative Commons)
📸 “Reminder Note” by Donal Mountain. (Creative Commons)
📸 “Cat on Laptop” by Doug Woods. (Creative Commons)
📸 “Up close Muscari” by Uroš Novina. (Creative Commons)
🎨 “Cubic Rutabaga” generated by Microsoft Copilot.
📸 “You say you want a revolution” by me.
🌳 Andrew Williams for inspiring me to pick the idea up.
🤖 Microsoft Copilot for helping me iron out technical details and reviewing my writing.

January 29, 2026March 15, 2026

Sixgate – IPv6 Without Leaving Anyone Behind

The internet is inching toward IPv6, but not nearly fast enough. Server operators are increasingly comfortable running services on IPv6‑only infrastructure, yet a stubborn reality remains: many users are still stuck behind IPv4‑only networks. Legacy ISPs, old routers, and long‑lived hardware keep a surprising number of clients stranded on the old protocol.

That creates a familiar kind of stalemate. Just as the lack of SNI support in Windows XP once discouraged sites from adopting HTTPS, today’s pockets of IPv4‑only clients discourage operators from going IPv6‑only. No one wants to cut off part of their audience, so IPv4 lingers on long after its sell‑by date.

IPv6 was meant to break us free from IPv4’s scarcity, but the slowest movers in the ecosystem are holding the rest of the internet back.

“Ulysses, Ulysses, soaring through all the galaxies. In search of Earth, flying into the night.”

🌉 What Is Sixgate?

Sixgate is a lightweight mechanism that allows IPv4-only clients to reach IPv6-only servers without requiring IPv4 infrastructure. It works by letting clients automatically discover and use a gateway operated by the server network to tunnel IPv6 packets over IPv4.

Here’s how Sixgate bridges the gap:

The client attempts to connect to a website and receives only an IPv6 address from DNS. The client knows that it cannot connect, either due to a missing IPv6 configuration or a previously failed attempt.
The client performs a second DNS query, asking for a special SRV record associated with the IPv6 address at “_sixgate._udp.<addr>.ip6.arpa” from the same zone as reverse DNS.
The SRV record points to an IPv4‑reachable gateway, operated by the server’s network, which will forward traffic to and from the IPv6‑only service.

What does the client do with that gateway IP? That’s intentionally left open. There are several mature tunnelling technologies that could sit behind a SixGate gateway, and choosing one is a separate design decision.

An earlier draft of this proposal bundled the discovery mechanism together with a specific tunnel protocol, but that turned out to be premature. SixGate is the discovery step. If we can agree that this step is useful, we can standardise the transport that follows it later.

🛠 Practical Realities

Even an IPv6-only cluster will need a single IPv4 address to operate the gateway. That’s a small concession. Far less costly than maintaining IPv4 across all services. The gateway becomes a focused point of compatibility, not a sprawling legacy burden. The gateway itself need not be part of the server farm it supports, but should be close to the same network path that normal IPv6 traffic takes.

Additionally, DNS itself must remain reachable over IPv4. Clients need to resolve both the original IPv6 address and the SRV record for the gateway. Fortunately, DNS infrastructure is already deeply entrenched in IPv4, and this proposal doesn’t require any changes to that foundation.

The SRV lookup is for each IPv6 address, but I imagine that in practice, there will be a single SixGate service covering a large range of IPv6 addresses, perhaps for the whole data center. In this case, the DNS service would respond with the same SRV record for all those IPv6 addresses in that block.

Delegation of the reverse-DNS ip6.arpa space can be messy. I considered alternatives but it was clear that this is only place the all-important SRV record could go. Any new domain or alternative distributed lookup system would bring along those same problems.

🚀 Why Sixgate Matters

The beauty of Sixgate is that it shifts the burden away from ISPs and toward software. Updating an operating system or browser to support this fall-back logic is vastly easier than convincing hundreds of ISPs to overhaul their networks. Software updates can be rolled out in weeks. ISP transitions take years — sometimes decades.

By giving server operators a graceful way to drop IPv4, we accelerate the transition without leaving legacy clients behind. It’s a bridge, not a wall.

⛄ The Server’s Problem

You might be thinking that there are already tunnel services that allow IPv4-only users to access the IPv6 world. Those are great but they’re fixing the wrong problem.

Technologies like CGNAT have made IPv4 “good enough” for ISPs. Moving to IPv6 would require a whole bunch of new equipment and no-one other than a few nerds are actually asking for it. This has been the status quo for decades.

As a server operator, I’m not going to deploy new services on IPv6‑only infrastructure, until I can be sure my customers can access them, which means I’m going to have to keep IPv4 around, with all the costs that implies.

From the user’s perspective, their internet connection “just works”. They don’t know what IPv4 or IPv6 is and they shouldn’t have to. If they try to connect to my service and it fails, they won’t start thinking they should sign up for a tunnel, they’ll think, quite reasonably, that my website is broken.

Tunnels put the burden on the least‑equipped party: the end‑user. They require sign‑ups, configuration and payment. They assume technical knowledge that most customers simply don’t have. They create friction at exactly the wrong place.

Telling a potential customer to “go fix your internet” is not a viable business model.

🥕 The Tipping Point

Over time, as more clients can reach IPv6‑only services—either natively or through Sixgate—the balance shifts. Once a meaningful share of users can connect without relying on IPv4, the economic pressure on server operators changes. New deployments can finally consider dropping IPv4 entirely, because the remaining IPv4‑only clients aren’t left stranded; they’re automatically bridged.

That’s the real goal. As long as servers must maintain IPv4 for compatibility, the whole ecosystem stays anchored to the past. But if software can route around the ISPs that refuse to modernise, IPv6 can start to stand on its own merits. Sixgate doesn’t replace IPv4 overnight; it simply removes the last excuse for keeping it alive.

🔭 What Comes Next?

This is a sketch, but it’s one that could be prototyped quickly. A browser extension, a client library, or even a reference implementation could demonstrate its viability. Once the action of the gateway itself has been agreed, it could be standardized, adopted, and quietly become part of the internet’s connective tissue.

If you’re a server operator dreaming of an IPv6-only future, this might be your missing piece. And if you’re a protocol designer or systems thinker, I’d love to hear your thoughts.

Let’s build the bridge and let IPv6 finally stand on its own.

“Your smile is like a breath of spring. Your voice is soft like summer rain. I cannot compete with you.”

Next: A quick security analysis.

Credits:
📸 “Snow Scot” by Peeja. (With permission.)
📸 “El Galeón” by Dennis Jarvis. (Creative Commons.)
☕ “Lenny S” for reminding me I need to write something about tunnel services.
🤖 Microsoft Copilot for the rubber-ducking.
📂 Previous version at archive.org

November 10, 2025May 29, 2026

Will We Run Out of Unicodes?

Before Unicode, digital text lived in a fragmented world of 8-bit encodings. ASCII had settled in as the good-enough-for-English core, taking up the first half of codes, but the other half was a mish-mash of regional code pages that mapped characters differently depending on locale. One set for accented Latin letters, another set for Cyrillic.

Each system carried its own assumptions, collisions, and blind spots. Unicode emerged as a unifying vision. a single character set for all human languages, built on a 16-bit foundation. All developers had to do was swap their 8-bit loops for 16-bit loops. Some bristled that half the bytes were all zeros, but this was for the greater good.

16-bits made 65,536 code points. It was a bold expansion from the cramped quarters of ASCII, a ceremonial leap into linguistic universality. This was enough, it was thought, to encode the entirety of written expression. After all, how many characters could the world possibly need?

“Remember this girls. None of you can be first, but all of you can be next.”

🐹 I absolutely UTF-8 those zero bytes.

It was in this world of 16-bit Unicode that UTF-8 emerged. This had the notable benefit of being compatible with 7-bit ASCII, using the second half of ASCII to encode the non-ASCII side of Unicode as multiple byte sequences.

If your code knew how to work with ASCII it would probably work with UTF-8 without any changes needed. So long as it passed over those multi-byte sequences without attempting to interpret them, you’d be fine. The trade-off was that while ASCII characters only took up one byte, most of Unicode took three bytes, with the letters-with-accents occupying the two-bytes-per-character range.

This wasn’t the hard limit of UTF-8. The initial design allowed for up to 31-bit character codes. Plenty of room for expansion!

🔨 Knocking on UTF-16’s door.

As linguistic diversity, historical scripts, emoji, and symbolic notations clamoured for representation, the Unicode Consortium realised their neat two-byte packages would not be enough and needed to be extended. The world could have moved over to the UTF-8 where there was plenty of room, but too many systems had 16-bit Unicode baked in.

The community that doggedly stuck with ASCII and its 8-bits per character design must have felt a bit smug seeing the rest of the world move to 16-bit Unicode. They stuck with their good-enough-for-English encoding and were rewarded with UTF-8 with its ASCII compatibility and plenty of room for expansion. Meanwhile, those early adopters who made the effort to move to the purity of their fixed size 16-bit encoding were told that their characters weren’t going to be fixed size any more.

This would be the plan to move beyond the 65,536 limit. Two unused blocks of 1024 codes were set aside. If you wanted a character in the original range of 16-bit values, you’d use the 16-bit code as normal, but if you wanted a character from the new extended space, you had to put two 16-bit codes from these blocks together. The first 16-bit code gave you 10 bits (1024=2¹⁰) and the second 16-bit code you 10 more bits, making 20 bits in total.

(Incidentally, we need two separate blocks to allow for self-synchronization. If we only had one block of 1024 codes, we could not drop into the middle of a stream of 16-bit codes and simply start reading. It is only by having two blocks you know that if the first 16-bit code you read is from the second block, you know to discard that one and continue afresh from the next one.)

The original Unicode was rechristened the “Basic Multilingual Plane” or plane zero, while the 20-bit codes allowed by this new encoding were split into 16 separate “planes” of 65,536 codes each, numbered from 1 to (hexadecimal) 10. UTF-16 with its one million possible codes was born.

UTF-8 was standardized to match UTF-16 limits. Plane zero characters were represented by one, two or three byte sequences as before, but the new extended planes required four byte sequences. The longer byte sequences were still there but cordoned off with a “Here be dragons” sign, their byte patterns declared meaningless.

“Don’t need quarters, don’t need dimes, to call a friend of mine. Don’t need computer or TV to have a real good time.”

🧩 What If We Run Out Again?

Unicode’s architects once believed 64K code points would suffice. Then they expanded to a little over a million. But what if we run out again?

It’s not as far-fetched as it sounds. Scripts evolve. Emoji proliferate. Symbolic domains—mathematical, musical, magical—keep expanding. And if humanity ever starts encoding dreams, gestures, or interspecies diplomacy, we might need more.

Fortunately, UTF-8 is quietly prepared. Recall that its original design allowed for up to 31-bit code points, using up to 7 bytes per character. The technical definition of UTF-8 restricts itself to 21 bits, but the scaffolding for expansion is still there.

On the other hand, UTF-16 was never designed to handle more than a million codes. There’s no large unused range of unused code in plane zero to add more bits. But what if we need more?

For now, we can relax a little because we’re way short. Of the 17 planes, only the first four and last three have any codes allocated to them. Ten planes are unused. Could we pull the same trick with that unused space again?

🧮 An Encoding Scheme for UTF-16X

Let’s say we do decide to extend UTF-16 to 31 bits in order to match UTF-8’s original ceiling. Here’s a proposal:

Planes C and D (0xC0000 to 0xDFFFF) are mostly unused, aside from two reserved codes at the end of each.
We designate 49152 codes (2¹⁴+2¹⁵) from each plane as encoding units. This number is close to √2³¹, making it a natural fit.
A Plane C code followed by a Plane D code form a composite: (C×49152+D)
This yields over 2.4 billion combinations, which is more than enough to cover the 31-bit space.

This leaves us with these encoding patterns:

Basic Unicode is represented by a single 16-bit code.
The 16 extended planes by two 16-bit codes.
The remaining 31-bit space as two codes from the C and D planes, or four 16-bit codes.

This scheme would require a new decoder logic, but it mirrors the original surrogate pair trick with mathematical grace. It’s a ritual echo, scaled to the future. Code that only knows about the 17 planes will continue to work with this encoding as long as it simply passes the codes along rather than trying to apply any meaning to them, just like UTF-8 does.

🔥 An Encoding and Decoding Example

Let’s say we want to encode a Unicode code point 123456789 using the UTF-16X proposal above.

To encode into a plane C and plane D pair, divide and mod by 49152:

Plane C index: C = floor(123456789 / 49152) = 2512
Plane D index: D = 123456789 % 49152 = 21381

To get the actual UTF-16 values, add accordingly:

Plane C code: 0xC0000 + 2512 = 0xC09C0
Plane D code: 0xD0000 + 21381 = 0xD537D

To decode these two UTF-16 codes back, mask off the C and D plane bits to multiply and add the two values:

2512 × 49152 + 21381 = 123456789

🧠 Reader’s Exercise

Try rewriting the encoding and decoding steps above using only bitwise operations. Remember that 49,152 was chosen for its bit pattern and that you can replace multiplication and division with combinations of shifts and additions.

🌌 The Threshold of Plane B

Unicode’s expansion has been deliberate, almost ceremonial. Planes 4 through A remain largely untouched, a leisurely frontier for future scripts, symbols, and ceremonial glyphs. We allocate codes as needed, with time to reflect, revise, and ritualize.

But once Plane B begins to fill—once we cross into 0xB0000—we’ll be standing at a threshold. That’s the moment to decide how, or if, we go beyond?

As I write this, around a third of all possible code-points have been allocated. What will we be thinking that day in the future? Will those last few blocks be enough for what we need? Whatever we choose, it should be deliberate. Not just a technical fix, but a narrative decision. A moment of protocol poetry.

Because encoding isn’t just compression—it’s commitment. And Plane B is where the future begins.

“I could say Bella Bella, even Sehr Wunderbar. Each language only helps me tell you how grand you are.”

Credits
📸 “Dasha in a bun” by Keri Ivy. (Creative Commons)
📸 “Toco Toucan” by Bernard Dupont. (Creative Commons)
📸 “No Public Access” by me.
🤖 Editorial assistance and ceremonial decoding provided by Echoquill, my AI collaborator.

October 20, 2025October 20, 2025

Rufus – An Adventure in Downloading

I needed to make a bootable USB. Simple task, right? My aging Windows 10 machine couldn’t upgrade to 11 and Ubuntu seemed like the obvious next step.

Downloading Rufus, the tiny tool everyone recommends, turned out to be less of a utility and more of a trust exercise. Between misleading ads, ambiguous signatures and the creeping dread of running an EXE as administrator, I found myself wondering how something so basic became so fraught?

Click Here to regret everything…

Here’s what I saw when I browsed to rufus.ie:

“Her weapons were her crystal eyes, making every man mad.”

I’ve redacted the name of the product being advertised. This isn’t really about them and they may very well be legitimate advertisers. Point is, I have no idea if they’re dodgy or not. I’m here to download the Rufus app thanks very much. I’m fortunate enough to have been around long enough to recognise an ad but I wonder how someone else who might be following instructions to “Download Rufus from rufus.ie” would cope.

Wading through the ads, I found the link that probably had the EXE I actually wanted. Hovering my pointer over the link had a reasonable looking URL. I clicked…

“She’s got it! Yeah baby, she’s got it!”

At some point during my clicking around, two EXEs were deposited in my “Downloads” folder. It looked like the same EXE but one had “(1)” on the end, so I had probably downloaded it twice. I right-clicked the file and looked for the expected digital signature: Akeo Consulting.

Even now, am I quite certain that this “Akeo Consulting” is the right one? Could one of those dodgy-looking advertisers formed their own company that’s also called Akeo Consulting but in a different place, in order to get a legitimate digital signature onto their own EXE? And this is an executable I’d need to run as administrator, with no restrictions.

At the end of the day, I am complaining about something someone is doing for free. I can already hear the comments that I’m “free to build my own”. I know how much it costs to run a website, especially one that’s probably experiencing a sudden spike in traffic while people find they need to move from Windows 10.

I’m not blaming this project, I’m blaming society. If the Rufus Project had to choose between accepting advertiser money to keep the lights on or shutting down, I’m not going to tell them they should have chosen the latter option. But if this is where we are as a society, we’ve made a mistake along the way.

Credits:
🔨 The Rufus Project, for their (at the end of the day) very useful app.
🤖 Microsoft Copilot for spelling/grammar checking, reviews and rubber-ducking.

July 29, 2025

Dear string-to-integer parsers…

These are very useful functions that any language with distinct string and integer types will include in their standard library. Pass in a string with decimal digits and it’ll return the equivalent in the binary integer form that you can do mathematics with.

I’d like to make a modest proposal that I’d find very useful, and maybe you, dear reader, would too.

“The rich man in his castle, the poor man at his gate. He made them, high or lowly, and ordered their estate.”

Who me?

Specifically, I’m thinking of parser functions that work like this…

ParseInt("123");      // 123.
ParseInt("-456");     // -456.
ParseInt("Rutabaga"); // Rejected.

Note that by “rejected”, it could mean anything in practice as long as the response is distinct from returning a number. Maybe it throws an exception, maybe it returns null, maybe it also returns a Boolean to tell you if the string value was valid or not.

Point is, I’m thinking of parser functions that have two distinct kinds of result. A success result that includes the integer value, or a rejection result. No half-way results.

I will acknowledge that there are standard library functions that will keep going along the string gobbling digits, until it hits a non-digit and the response tells the caller what number it found and where that first non-digit is. Those are very useful for tokenizing loops as part of compilers, but my idea would break that interface too much. If that’s your variety of parser, sorry, but this post isn’t for you.

Also, I’m thinking of functions that parse as decimal. Maybe you have optional flags that allow you to specify what base to use, but it parses as decimal by default. I’m concerned only with the decimal mode of operation.

Round Numbers and “E” Notation

You might be familiar with “E” notation if you work with very large or very small floating point numbers. This is a shorthand for scientific notation where the letter E translates to “times ten to the power of”.

FloatParse("1E3");    // 1000.0
FloatParse("5E-3");   // 0.005
FloatParse("1E+100"); // One Googol.

This notation is handy for decimal round numbers. If you want to type in a billion, instead of having to count as you press the zero key on your keyboard over and over, you could instead type “1E9”. Which one of the following numbers is a billion? Can you tell at a glance?

100000000 10000000000 1000000000

The problem is that E notation is stuck in the floating-point world. I’d really like it if anywhere I could type an integer (such as in an electronic form) and I want to type a large round number, I could use E notation instead.

For that to work, the functions that convert strings to integers need to allow this.

Pinning it down

Okay, we’re all software engineers here. Let’s talk specifics.

If the string supplied to the function is of the form (mantissa)"E"(exponent), with the mantissa in the range 1-9 and the exponent from zero to however high your integer type gets, then instead of rejecting the string, return the integer value this E notation string represents.

Add the usual range checks (for example, 9E18 for a signed 64-bit integer) and do the right thing when there’s a minus sign character at the start and we’re done.

“But there might be code depending on values like that being rejected!”

That’s a fair concern. I am advocating for a change in behaviour in the standard library after all.

I am seeking only to change behaviour in the domain of inputs that would otherwise produce a rejection response.

If IntParse("1E3") used to return a rejection, but now it returns 1000, is that a bad thing? The user can already type "1000" but this time they wrote "1E3" instead. What’s the harm in carrying on as if they typed 1000 all along?

I can think of some pathological cases. Maybe the programmer wanted to limit an input to 1000, but instead of using the less-than operator on the integer like a normal person, they check that the length of the string less than 4. "1E9" would pass validation but a billion would be returned. It seems unlikely that anyone would do that in practice.

The parser function might be used not to actually use the integer returned, but instead act as a validator. You have a string and you want to know if the string is a valid sequence of decimal digits or not. If that’s what you need, the integer-parser is maybe the wrong tool for that. Parsers will already be a little flexible about the range of allowable inputs, allowing leading plusses or zero digits and commas grouping digits into triples. If you care that a string is actually the one canonical ASCII representation of a number or not, then I would follow the parse with a test converting the integer back into a string and checking it matches the input string.

“E might be a hex digit.”

Your function returns the number 7696 for the input "1E10" and not ten billion? What you’ve got there is a hex parser, not a decimal parser. E notation only make sense in the world of decimal numbers.

If your decimal parser automatically switches to hex parsing if it sees ‘A’ to ‘F’ characters, then you’ve got a parser that’s unreliable for hex number strings. A lot of hex numbers contain only the ‘0’ to ‘9’ digits. If my code gets a hex number as input, I’m going to call the hex parser. Some supposed general purpose parser isn’t going to know if "1000" should return 1000, 4096 or 8 and will need to be told.

While we’re on the subject of hex numbers, I may be following this up with a proposal that “H” should mean “times 16 to the power of” in a similar style, but that’ll be for another day.

“Delores, I live in fear. My love for you is so overpowering. I’m afraid that I will disappear.”

“Because counting to nine is really hard”

So there’s my suggestion. In short, I’m fed up of having to count to nine when I want to type a billion and having to check by counting the little row of identical ovals on the screen. I look forward to comments telling me how wrong I am.

Picture Credits
📸 “Swift” by Tristan Ferne. (Creative Commons.)
📸 “Kibo Summit, Mount Kilimanjaro, Tanzania” by Ray in Manila. (Creative Commons.)

(Also, a billion is a one followed by nine zeros. Anyone who says it has twelve zeros is quite wrong.)

April 17, 2025

What type of UUID should I use?

UUIDs, Universally Unique IDs, are handy 128 bit IDs. Their values are unique, universally, hence the name.

(If you work with Microsoft, you call them GUIDs. I do primarily think of them as GUIDs, but I’m going to stick with calling them UUIDs for this article, as I think that name is more common.)

These are useful for IDs. Thanks to their universal uniqueness, you could have a distributed set of machines, each producing their own IDs, without any co-ordination necessary, even completely disconnected from each other, without worrying about any of those IDs colliding.

When you look at a UUID value, it will usually be expressed in hex and (because reasons) in hyphen-separated groups of 8-4-4-4-12 digits.

–7–

You can tell which type of UUID it is by looking at the highlighted digit, the first of the middle of the four-digit blocks. That digit always tells you which type of UUID you’re looking at. This one is a type 7 because that hex-digit is a 7. If it was a 4 it would be a type 4.

As I write this, there are 8 types to chose from. But which type should you use? Type 7. Use type 7. If that’s all you came for, you can stop here. You ain’t going to need the others.

Type 7 – The one you actually want.

This type of UUID was designed for assigning IDs to records on database tables.

The main thing about type 7 is that the first block of bits are a time stamp. Since time always goes forward ^{[citation needed]} and the timestamp is right at the front, each UUID you generate will have a bigger value than the last one.

This is important for databases, as they are optimized for “ordered” IDs like this. To oversimplify it, each database table has an index tracking each record by its ID, allowing any particular record to be located quickly by flipping through the book until you get close to the one you wanted. The simplest place to add a new ID is to add it on the end and you can only do that if your new ID comes after all the previous ones. Adding a new record anywhere else will require that index to be reorganised to make space for that new one in the middle.

(You often see UUIDs criticised for being random and unordered, but that’s type 4. Don’t use type 4.)

The timestamp is 48 bits long and counts the number of milliseconds since the year 1970. This means we’re good until shortly after the year 10,000. Other than the 6 bits which are always fixed, the remaining 74 bits are randomness which is there so all the UUIDs created in the same millisecond will be different. (Except it is a little more complicated than that. Read the RFC.)

So there we are. Type 7 UUIDs rule, all other types drool. We done?

“I was born in a flame. Mama said that everyone would know my name. I’m the best you’ve ever had. If you think I’m burning out, I never am.”

Migrating from auto-incrementing IDs.

Suppose you have an established table with a 32-bit auto-incrementing integer primary key. You want to migrate to type 7 UUIDs but you still need to keep the old IDs working. A user might come along with a legacy integer ID and you still want to allow that request to keep working as it did before.

You could create a bulk of new type 7 UUIDs and build a new table that maps the legacy integer IDs to their new UUID. If that works for you, that’s great, but we can do without that table with a little bit of cleverness.

Let’s think about our requirements:

We want to deterministically convert a legacy ID into its UUID.
These UUIDs are in the same order as the original legacy IDs.
New record’s UUIDs come after all the UUIDs for legacy records.
We maintain the “universally unique”-ness of the IDs.

This is where we introduce type 8 UUIDs. The only rule of this type is that there are no rules. (Except they still have to be 128 bits and six of those bits must have fixed values. Okay, there are a few rules.) It is up to you how you construct this type of UUID.

Given our requirements, let’s sketch out how we want to layout the bits of these IDs.

The type 7 UUIDs all start with a 01 byte, until 2039 when they will start 02. They won’t ever start with a 00 byte. So to ensure these IDs are always before any new IDs, we’ll make the first four hex digits all zeros. The legacy 32-bit integer ID can be the next four bytes.

Because we want the UUIDs we create to be both deterministic and universally-unique, the remaining bits need to look random but not actually be random. Running a hash function over the ID and a fixed salt string will produce enough bits to fill in the remaining bits.

Now, to convert a legacy 32-bit ID into its equivalent UUID, we do the following:

Start an array of bytes with two zero bytes.
Append the four bytes of legacy ID, most significant byte first.
Find the SHA of (“salt” + legacy ID) and append the first 10 bytes of the hash to the array.
Overwrite the six fixed bits (in the hash area) to their required values.
Put the 16 bytes you’ve collected into a UUID type.

And there we have it. When a user arrives with a legacy ID, we can deterministically turn it into its UUID without needing a mapping table or conversion service. Because of the initial zero bytes, these UUIDs will always come before the new type 7 UUIDs. Because the legacy ID bytes come next, the new UUIDs will maintain the same order as the legacy IDs. Because 74 bits come from a hash function with a salt as part of its input, universal-uniqueness is maintained.

What’s that? You need deterministic UUIDs but it isn’t as simple as dropping the bytes into place?

“You once thought of me as a white knight on his steed. Now you know how happy I can be.”

Deterministic UUIDs – Types 3 and 5.

These two types of UUID are the official deterministic types. If you have (say) a URL and you want to produce a UUID that represents that URL, these UUID types will do it. As long as you’re consistent with capital letters and character encoding, the same URL will always produce the same UUID.

The down-side of these types is that the UUID values don’t even try to be ordered, which is why I wrote the discussion of type 8 first. If the ordering of IDs is important, such as using them as primary keys, maybe think about doing it a different way.

Generation of these UUIDs work by hashing together a “namespace” UUID and the string you want to convert into a UUID. The hash algorithm is MD5 for type 3 or SHA1 for type 5. (In the case of SHA1, everything after the first 128 bits of hash are discarded.)

To use these UUIDs, suppose a user makes a request with a string value, you can turn that string into a deterministic UUID by running it through the generator function. That function will have two parameters, a namespace UUID (which could be a standard namespace or one you’ve invented) and the string to convert. That function will run the hash function over the input and return the result as a UUID.

These UUID types do the job they’re designed to do. Just as long as you’re okay with the values not being ordered.

Type 3 (MD5) or Type 5 (SHA1)?

There are pros and cons to each one.

MD5 is faster than SHA1. If you’re producing them in bulk, that may be a consideration.

MD5 is known to be vulnerable to collisions. If you have (say) a URL that hashes to a particular type 3 UUID, someone could construct a different URL that hashes to the same UUID. Is that a problem? If you’re the only one building these URLs that get hashed, then a hypothetical doer of evil isn’t going to get to have their bad URL injected in.

Remember, the point of a UUID is to be an ID, not something that security should be depending upon. Even the type 5 UUID throws away a big chunk of the bits produced, leaving only 122 bits behind.

If you want to hash something for security, use SHA256 or SHA3 and keep all the bits. Don’t use UUID as a convenient hashing function. That’s not what its for!

On balance, I would pick type 5. While type 3 is faster, the difference is trivial unless you’re producing IDs in bulk. You might think that MD5 collisions are impossible with the range of inputs you’re working with, but are you quite sure?

“I’ve seen this thing before, in my best friend and the boy next door. Fool for love and fool on fire.”

Type 4 – The elephant in the room

A type 4 UUID is one generated from 122 bits of cryptographic quality randomness. Almost all UUIDs you see out there will be of this type.

Don’t use these any more. Use type 7. If you’re the developer of a library that generates type 4 UUIDs, please switch it to generating type 7s instead.

Seriously, I looked for practical use cases for type 4 UUIDs. Everything I could come up was either better served by type 7, or both types came out as the same. I could not come up with a use-case where type 4 was actually better. (Please leave a comment if you have one.)

Except I did think of a couple of use-cases, but even then, you still don’t want to use type 4 UUIDs.

Don’t use UUIDs as secure tokens.

You shouldn’t use UUIDs as security tokens. They are designed to be IDs. If you want a security token, you almost certainly have a library that will produce them for you. The library that produces type 4 UUIDs uses one internally.

When you generate a type 4 UUID, six bits of randomness are thrown away in order to make it a valid UUID. It takes up the space of a 128 bit token but only has 122 bits of randomness.

Also, you’re stuck with those 122 bits. If you want more, you’d have to start joining them together. And you should want more – 256 bits is a common standard length for a reason.

But most of all, there’s a risk that whoever wrote the library that generates your UUIDs will read this article and push out a new version that generates type 7 UUIDs instead. Those do an even worse at being security tokens.

I’m sure they’d mention it in that library’s release notes but are you going to remember this detail? You just want to update this one library because a dependency needs the new version. You tested the new version and it all works fine but suddenly your service is producing really insecure tokens.

Maybe the developers of UUID libraries wouldn’t do that, precisely because of the possibility of misuse, but that’s even more reason to not use UUIDs as security tokens. We’re holding back progress!

In Conclusion…

Use type 7 UUIDs.

“Only to find the night-watchman, unaware of his presence in the building.”

Picture Credits.
📸 “Night Ranger…” by Doug Bowman. (Creative Commons)
📸 “Cat” by Adrian Scottow. (Creative Commons)
📸 “Cat-36” by Lynn Chan. (Creative Commons)
📸 “A random landscape on a random day” by Ivo Haerma (Creative Commons)
📸 “Elena” by my anonymous wife. (With Permission)

February 12, 2025February 13, 2025

I want a less powerful programming language for Christmas.

I’m writing this because I’m hoping someone will respond, telling me that what I want already exists. I have a specific itch and my suspicion is that developing a whole programming language and runtime is the only way to scratch that itch.

Please tell me I’m wrong.

Dear Father Christmas…

If you’ve ever written a web service, you’ve almost certainly had situations where you’ve taken a bunch of bytes from a completely untrusted stranger and passed those bytes into a JSON parser. What’s more you’ll have done that without validating the bytes first.

Processing your inputs without sanitizing it first? Has Bobby Tables taught us nothing?

You can do this safely because that JSON parser will have been designed to be used in this manner and will be safe in the face of hostile inputs. If you did try feeding the bytes of an EXE file into a JSON parser, it’ll very quickly reject it complaining that “MZ” isn’t an opening brace and refuse to continue beyond that. The worst a hostile user could do is put rude messages inside the JSON strings.

{ "You": "A complete \uD83D\uDC18 head!" }

Now take that idea and think about what if you did have a web service where completely unauthenticated users could use any request body they liked and your service would run that request body in a copy of Python as the program source code.

Hopefully, you’ve just now remarked that it would be a very bad idea, up there with Napoleon’s idea to make his brother the King of Spain. But that’s exactly what I want to do. I want to write a web service that accepts Python code from complete strangers and actually run that code.

(And also make my brother the King of Spain. He’d be great!)

“Hang on to your hopes, my friend. That’s an easy thing to say. But if your hopes should pass away, simply pretend that you can build them again.”

At the gates of dawn

Some time in the early 90s, I had a game called “C Robots”.

This is a game where four tanks are in an arena, driving around and firing missiles at each other. But instead of humans controlling those tanks, each tank was controlled by a program written by the human player. The game controller would keep track of each tank and any missiles in flight, passing back control to each tank’s controller program to let it decide what its next move will be.

For 90s me, programming a robot appealed to me but the tank battle part did not appeal so much. I really wanted to make a robot to play other games that might not involve tanks. At the time, there were two games I enjoyed playing with school friends, Dots-and-Boxes and Rummy. I had an idea of what made good strategies for these specific games, so I thought building those strategies into code might make for a good intellectual exercise.

Decades passed and I built a simple game controller system which I (rather pompously) called “Tourk“. I had made a start on the controllers for a handful of games but I hadn’t gotten around to actually writing actual competitive players, only simple random ones that were good for testing. I imagined that before long, people would write their own players, send them in to me and I’d compile them all together. After I’d let it ran for a million games in a tournament I’d announce the winner.

If anyone had actually written a player and sent it in, my first step would have been to inspect the submitted code thoroughly. These would have been actual C programs and could have done anything a C program could do, including dropping viruses on my hard disk, so inspecting that code would have been very important. Looking back, I’m glad no-one actually did that.

But this was one thing C Robots got right, even if it wasn’t planned that way. Once it compiled the player’s C code, it would run that code in a restricted runtime. Your player code could never go outside its bounds because there’s no instructions in the C Robots runtime to do that. This meant that no-one could use this as an attack vector. (But don’t quote me on that. I’ve not actually audited the code.)

“I never ever ask where do you go. I never ever ask what do you do. I never ever ask what’s in your mind. I never ever ask if you’ll be mine.”

Will the runtime do it?

Could maybe the dot-net runtime or the Python runtime have the answer?

This was one of the first questions I asked on the (then) new Stack Overflow. The answer sent me to Microsoft’s page on “Code Access Security” and if you follow that link now, it says this feature is no longer supported.

Wondering more recently if Python might have an option to do what I wanted, I asked on Hacker News if there was a way to run Python in the way I wanted. There were a few comments but it didn’t get enough up-votes and disappeared fairly quickly. What little discussion we had was more to do with a side issue than the actual question I was asking.

I do feel that the answer might still be here. There’s quite possibly some flag on the runtime that will make any call to an extern function impossible. The Python runtime without the “os” package would seem to get 90% of the way there, but I don’t know enough about it to be certain enough that this won’t have left any holes open.

“We’re all someone’s daughter. We’re all someone’s son.”

Sanitize Your inputs?

Maybe I should listen to Bobby Tables and sanitize my inputs before running them.

Keep the unrestricted runtime, but before we invoke it to run the potentially hostile code, scan it to check it won’t do any bad things.

Simple arithmetic in a loop? That’s fine.
Running a remote access trojan? No.

Once it has passed the test, you should be able to allow the code to run, confident it won’t do anything bad because you’ve already checked it won’t. This approach appeals to me because once that initial test has passed the code for non-hostility, we can allow the runtime to go at full speed.

The problem with this approach are all the edge cases and finding that line between simple arithmetic and remote-access-trojans. You need to allow enough for the actually-not-hostile code to do useful things, but not enough that a hostile user could exploit.

Joining strings together is fine but passing that string into eval is not.
Writing text to stdout is fine but writing into a network socket is not.

Finding that line is going to be difficult. The best approach would be to start with nothing-is-allowed, but when considering what to add, first investigate what would be possible by adding that facility to allowed list. Because it can be used for bad things, eval would never be on that allowed list.

If there’s a function with a million useful things it can do but one bad thing, that function must never be allowed.

“We can go where we want to. A place they’ll never find. We can act like we come from out of this world and leave the real one far behind.”

Ask the Operating System?

I told a colleague about this post while I was still writing it and he mentioned that operating systems can have restrictions placed on programs it runs. He showed me his Mac and there was a utility that listed all the apps he was running and all the permissions it had. It reminded me that my Android phone does something similar. If any apps wants to interact with anything outside its realm, it has to ask first. This is why I’m happy to install apps on my Android phone but not on my Windows laptop.

This would be great, but how do I, a numpty developer, harness this power? What do I do if I want to launch a process (such as the Python runtime) but with all the permissions turned off? It feels like this will be the solution but my searching isn’t coming up with a practical answer.

My hope is that there’s a code library whose job it is to launch processes in this super restricted mode. It’ll work out which OS it is running on, do the necessary magic OS calls and finally launch the process in that super-restricted mode.

“If I was an astronaut I’d be floating in mid air. A broken heart would just belong to someone else down there. I would be the centre of my lonely universe. I’m only human and I’m crashing in the dark.”

Mmmm coffee!

The good people developing web browsers back in the 90s had the same need as me. They wanting to add a little interactivity to web pages, but without having to wait for a round trip back to the server over dialup, so they came up with a language they named JS.

As you read this page, your browser is running some code I supplied to you. That code can’t open up your files on your local device. If anyone did actually find a way to do that, the browser developers would call that a serious bug and push out an emergency update. So could JS be the solution I’m looking for?

As much as it sounds perfect, that JS runtime is inside the browser. If I have some JS code in my server process, how do I get that code into a browser process? Can I even run a web browser on a server without some sort of desktop environment?

The only project I know of where someone has taken JS outside of a browser is node-js. That might be the answer but I have written programs using node-js that load and save files. If this is the answer then I’d need to know how to configure the runtime to run the way I want.

“Play the game, fight the fight, but what’s the point on a beautiful night? Arm in arm, hand in hand. We all stand together.”

Is there an answer?

I began this post expressing my suspicion that the solution is to write my own runtime, designed from first-principles to run in a default-deny mode. I still wonder if that’s the case. I hope someone will read this post and maybe comment with the unknown option on the Python runtime that does exactly what I want.

In the meantime, I have another post in the works as with my thoughts on how this runtime and programming language could work. I hope I can skip it.

Gronda-Gronda.

Picture Credits
📸 “Snow Scot” by Peeja. (With permission.)
📸 “Meeting a Robot” by my anonymous wife. (With permission)
📸 “Great Dane floppy ears” by Sheila Sund. (Creative Commons)
📸 “Fun with cling film” by Elizabeth Gomm. (Creative Commons)
📸 “Rutabaga Ball 2” by Terrence McNally. (Creative Commons)
📸 “Nice day for blowing the cobwebs off” by Jurassic Snark. (With permission.)

(And just in case advocating for your brother to be made King of Spain is treason or something, I don’t actually want to do that. It was a joke.)