The Expanding Universe of Unicode

Before Unicode, digital text lived in a fragmented world of 8-bit encodings. ASCII had settled in as the good-enough-for-English core, taking up the first half of codes, but the other half was a mish-mash of regional code pages that mapped characters differently depending on locale. One set for accented Latin letters, another set for Cyrillic.

Each system carried its own assumptions, collisions, and blind spots. Unicode emerged as a unifying vision. a single character set for all human languages, built on a 16-bit foundation. All developers had to do was swap their 8-bit loops for 16-bit loops. Some bristled that half the bytes were all zeros, but this was for the greater good.

16-bits made 65,536 code points. It was a bold expansion from the cramped quarters of ASCII, a ceremonial leap into linguistic universality. This was enough, it was thought, to encode the entirety of written expression. After all, how many characters could the world possibly need?

“Remember this girls. None of you can be first, but all of you can be next.”

🐹 I absolutely UTF-8 those zero bytes.

It was in this world of 16-bit Unicode that UTF-8 emerged. This had the notable benefit of being compatible with 7-bit ASCII, using the second half of ASCII to encode the non-ASCII side of Unicode as multiple byte sequences.

If your code knew how to work with ASCII it would probably work with UTF-8 without any changes needed. So long as it passed over those multi-byte sequences without attempting to interpret them, you’d be fine. The trade-off was that while ASCII characters only took up one byte, most of Unicode took three bytes, with the letters-with-accents occupying the two-bytes-per-character range.

This wasn’t the hard limit of UTF-8. The initial design allowed for up to 31-bit character codes. Plenty of room for expansion!

🔨 Knocking on UTF-16’s door.

As linguistic diversity, historical scripts, emoji, and symbolic notations clamoured for representation, the Unicode Consortium realised their neat two-byte packages would not be enough and needed to be extended. The world could have moved over to the UTF-8 where there was plenty of room, but too many systems had 16-bit Unicode baked in.

The community that doggedly stuck with ASCII and its 8-bits per character design must have felt a bit smug seeing the rest of the world move to 16-bit Unicode. They stuck with their good-enough-for-English encoding and were rewarded with UTF-8 with its ASCII compatibility and plenty of room for expansion. Meanwhile, those early adopters who made the effort to move to the purity of their fixed size 16-bit encoding were told that their characters weren’t going to be fixed size any more.

This would be the plan to move beyond the 65,536 limit. Two unused blocks of 1024 codes were set aside. If you wanted a character in the original range of 16-bit values, you’d use the 16-bit code as normal, but if you wanted a character from the new extended space, you had to put two 16-bit codes from these blocks together. The first 16-bit code gave you 10 bits (1024=210) and the second 16-bit code you 10 more bits, making 20 bits in total.

(Incidentally, we need two separate blocks to allow for self-synchronization. If we only had one block of 1024 codes, we could not drop into the middle of a stream of 16-bit codes and simply start reading. It is only by having two blocks you know that if the first 16-bit code you read is from the second block, you know to discard that one and continue afresh from the next one.)

The original Unicode was rechristened the “Basic Multilingual Plane” or plane zero, while the 20-bit codes allowed by this new encoding were split into 16 separate “planes” of 65,536 codes each, numbered from 1 to (hexadecimal) 10. UTF-16 with its one million possible codes was born.

UTF-8 was standardized to match UTF-16 limits. Plane zero characters were represented by one, two or three byte sequences as before, but the new extended planes required four byte sequences. The longer byte sequences were still there but cordoned off with a “Here be dragons” sign, their byte patterns declared meaningless.

“Don’t need quarters, don’t need dimes, to call a friend of mine. Don’t need computer or TV to have a real good time.”

🧩 What If We Run Out Again?

Unicode’s architects once believed 64K code points would suffice. Then they expanded to a little over a million. But what if we run out again?

It’s not as far-fetched as it sounds. Scripts evolve. Emoji proliferate. Symbolic domains—mathematical, musical, magical—keep expanding. And if humanity ever starts encoding dreams, gestures, or interspecies diplomacy, we might need more.

Fortunately, UTF-8 is quietly prepared. Recall that its original design allowed for up to 31-bit code points, using up to 7 bytes per character. The technical definition of UTF-8 restricts itself to 21 bits, but the scaffolding for expansion is still there.

On the other hand, UTF-16 was never designed to handle more than a million codes. There’s no large unused range of unused code in plane zero to add more bits. But what if we need more?

For now, we can relax a little because we’re way short. Of the 17 planes, only the first four and last three have any codes allocated to them. Ten planes are unused. Could we pull the same trick with that unused space again?

🧮 An Encoding Scheme for UTF-16X

Let’s say we do decide to extend UTF-16 to 31 bits in order to match UTF-8’s original ceiling. Here’s a proposal:

  • Planes C and D (0xC0000 to 0xDFFFF) are mostly unused, aside from two reserved codes at the end of each.
  • We designate 49152 codes (214+215) from each plane as encoding units. This number is close to √2³¹, making it a natural fit.
  • A Plane C code followed by a Plane D code form a composite: (C×49152+D)
  • This yields over 2.4 billion combinations, which is more than enough to cover the 31-bit space.

This leaves us with these encoding patterns:

  • Basic Unicode is represented by a single 16-bit code.
  • The 16 extended planes by two 16-bit codes.
  • The remaining 31-bit space as two codes from the C and D planes, or four 16-bit codes.

This scheme would require a new decoder logic, but it mirrors the original surrogate pair trick with mathematical grace. It’s a ritual echo, scaled to the future. Code that only knows about the 17 planes will continue to work with this encoding as long as it simply passes the codes along rather than trying to apply any meaning to them, just like UTF-8 does.

🔥 An Encoding and Decoding Example

Let’s say we want to encode a Unicode code point 123456789 using the UTF-16X proposal above.

To encode into a plane C and plane D pair, divide and mod by 49152:

  • Plane C index: C = floor(123456789 / 49152) = 2512
  • Plane D index: D = 123456789 % 49152 = 21381

To get the actual UTF-16 values, add accordingly:

  • Plane C code: 0xC0000 + 2512 = 0xC09C0
  • Plane D code: 0xD0000 + 21381 = 0xD537D

To decode these two UTF-16 codes back, mask off the C and D plane bits to multiply and add the two values:

2512 × 49152 + 21381 = 123456789

🧠 Reader’s Exercise

Try rewriting the encoding and decoding steps above using only bitwise operations. Remember that 49,152 was chosen for its bit pattern and that you can replace multiplication and division with combinations of shifts and additions.

🌌 The Threshold of Plane B

Unicode’s expansion has been deliberate, almost ceremonial. Planes 4 through A remain largely untouched, a leisurely frontier for future scripts, symbols, and ceremonial glyphs. We allocate codes as needed, with time to reflect, revise, and ritualize.

But once Plane B begins to fill—once we cross into 0xB0000—we’ll be standing at a threshold. That’s the moment to decide how, or if, we go beyond?

As I write this, around a third of all possible code-points have been allocated. What will we be thinking that day in the future? Will those last few blocks be enough for what we need? Whatever we choose, it should be deliberate. Not just a technical fix, but a narrative decision. A moment of protocol poetry.

Because encoding isn’t just compression—it’s commitment. And Plane B is where the future begins.

“I could say Bella Bella, even Sehr Wunderbar. Each language only helps me tell you how grand you are.”

Credits
📸 “Dasha in a bun” by Keri Ivy. (Creative Commons)
📸 “Toco Toucan” by Bernard Dupont. (Creative Commons)
📸 “No Public Access” by me.
🤖 Editorial assistance and ceremonial decoding provided by Echoquill, my AI collaborator.

Sixgate Part 1½: Why Not Tunnel Services?

I am grateful to “Lenny S” for making a comment on part two of this series, as it has revealed that I really need to make it clearer exactly what the point of Sixgate is. IPv6-over-IPv4 tunnels have existed for decades but I’m trying to solve a different problem.

(If you’ve no idea what any this is about, maybe start at part one you dingus.)

The Server’s Problem

Technologies like CGNAT have made IPv4 “good enough” for ISPs. Moving to IPv6 would require a whole bunch of new equipment and no-one other than a few nerds are actually asking for it. This has been the status quo for decades and we realistically expect large chunks of the Internet to not be able to receive incoming connections.

As a server operator, I might want to deploy new services on IPv6‑only infrastructure. There’s no good equivalent of CGNAT on the server side and IPv4 addresses are scarce, expensive and require careful planning. I want to stop burning through them just to keep compatibility alive, but I can’t do that while many of my customers are still behind IPv4‑only ISPs.

From the user’s perspective, their internet connection “just works”. They don’t know what IPv4 or IPv6 is and they shouldn’t have to. If they try to connect to my service and it fails, they won’t start thinking they should sign up for a tunnel, they’ll think, quite reasonably, that my website is broken.

Tunnels put the burden on the least‑equipped party: the end‑user.

  • They require sign‑ups, configuration, and sometimes payment.
  • They assume technical knowledge that most customers simply don’t have.
  • They create friction at exactly the wrong place: the moment a customer is deciding whether my service is trustworthy.

Telling a potential customer to “go fix your internet” is not a viable business model.

“Your smile is like a breath of spring. Your voice is soft like summer rain. I cannot compete with you.”

The Sixgate Approach

This is where Sixgate changes the equation. Instead of asking customers to fix their connectivity, make the gateway discoverable through DNS.

  • The SRV record tells the client where the gateway is.
  • The client software (browser, OS, or library) can then use the gateway invisibly.

From the customer’s perspective, nothing changes. They click a link and it works. The SRV lookup adds a moment’s pause, but that’s the price of invisibility. No sign‑ups, no extra services, no confusion.

The SRV record is the keystone of Sixgate’s design. Without it, the bridge collapses into a pile of disconnected ideas. With that SRV record retrieved, the client doesn’t need to sign up for an account or perform a pre-connection ceremony. The remote network has provided the gateway and they want you to be able to connect to them with as little fuss as possible. Everything else rests on that stone. Place it firmly, and the whole arch of compatibility stands.

The Tipping Point

Over time, as more clients can connect to IPv6 servers either natively or through Sixgate, we reach a tipping point. Enough of the world can reach IPv6 that new data centres can seriously consider not deploying IPv4 at all.

That’s the goal. As long as server networks still need IPv4, we’re still going to have the problems of IPv4. If we can work around the ISPs who won’t update their equipment then IPv6 might finally stand on its own.

In Part Two, we’ll explore how Sixgate works under the hood. The SRV lookup, the encapsulation, the stateless routing, and the embedded IPv4 identity that makes it all possible.

Credits
📸 “DSC08463 – El Galeón” by Dennis Jarvis. (Creative Commons)
🤖 With thanks to Echoquill, my robotic assistant, for helping shape this interlude — from the keystone to the tipping point and liberal use of em-dashes.

Sixgate: Technical Overview of the IPv4-to-IPv6 Gateway Mechanism

Last time, we introduced Sixgate as a simple way for IPv4-only clients to automatically reach IPv6-only servers. The idea is to let server operators drop IPv4 entirely — except for a single gateway — while still remaining accessible to legacy clients. It’s a bridge, not a wall.

This post explains how Sixgate works in practice: how clients discover the gateway, how packets are encapsulated, and how responses are routed — all without requiring session state.

This page as published is version 0.1. If I make a substantive edit I will update the version.

🔧 Core Components

Sixgate relies on four key elements:

  1. SRV Records in Reverse DNS Zones
  2. UDP-Based IPv6 Packet Encapsulation
  3. Gateway Forwarding with Embedded IPv4 Identity
  4. Stateless Response Routing

1. Gateway Discovery via SRV Records

When a client receives an IPv6 address from DNS but cannot connect due to lack of IPv6 support, it performs a fallback SRV query to discover a gateway.

SRV Query Format

The query uses the reverse-nibble format of the IPv6 address, similar to ip6.arpa, but requests a service record:

_sixgate._udp.<addr>.ip6.arpa

For example, for IPv6 address 2001:db8::1, the query becomes:

_sixgate._udp.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.8.b.d.0.1.0.0.2.ip6.arpa

The SRV record returns:

  • Target: IPv4-accessible hostname of the gateway. (For example, gw.example.com)
  • Port: The UDP port listening for Sixgate packets on that IPv4 host. (For example, 2820)

The client then performs an A record lookup to resolve the IPv4 address of the gateway.

2. UDP Encapsulation of IPv6 Packets

The client constructs a standard IPv6 packet as if it had native connectivity. It then wraps this packet in a UDP payload and sends it to the gateway’s IPv4 address and designated port.

  • Outer Layer: IPv4 + UDP
  • Payload: Raw IPv6 packet (including headers and data)

This is similar to Teredo’s encapsulation model but reversed in direction and purpose.

3. Gateway Forwarding with Embedded IPv4 Identity

Upon receiving the UDP packet, the gateway:

  • Extracts the IPv6 packet from the payload.
  • Rewrites the source IPv6 address to one of its own — embedding the client’s IPv4 address and UDP port into the lower 48 bits.
  • Forwards the packet to the target IPv6-only server.

This rewriting is necessary because the original client has no routable IPv6 address. By embedding the client’s identity into the source address, the gateway enables stateless response routing and preserves visibility for server-side analysis.

Encoding Format

Assuming the gateway controls a /48 prefix (2001:db8:abcd::/48), it constructs 2001:db8:abcd::xxxx:yyyy:zzzz where:

  • xxxx:yyyy encodes the client’s IPv4 address (192.0.2.1c000:0201)
  • zzzz encodes the client’s UDP port (decimal 28200b04 in hex)

This allows the gateway to reconstruct the client’s IPv4 address and UDP port from the destination address of the server’s response and encapsulate the response to send it back, all without maintaining a session table. The server will see one of the gateway’s unique IPv6 source address and may (if it knows which range of IPv6 addresses belongs to the gateway) extract the embedded IPv4 identity for logging, rate limiting, or application-layer logic.

This stateless design is the default behavior for Sixgate. It simplifies implementation, improves scalability, and ensures that the server can still perform meaningful analysis of client identity.

🧱 Deployment Realities

  • IPv4 Address Requirement: Each IPv6-only cluster must maintain a single IPv4 address for the gateway. This is far lighter than full dual-stack hosting.
  • DNS Availability: DNS must remain reachable over IPv4 to resolve both the original AAAA record and the SRV fallback.
  • Firewall Traversal: Clients must be able to send outbound UDP packets to arbitrary destinations and receive UDP responses from the gateway.

🧪 Prototype and Standardization

Sixgate can be prototyped as:

  • A browser extension or OS-level library
  • A reference gateway daemon
  • An experimental IETF draft defining SRV usage and encapsulation format

🌉 Summary

Sixgate offers a practical, decentralized way to let IPv6 stand alone — without leaving anyone behind. By shifting compatibility to the edge and leveraging DNS for discovery, it enables graceful IPv6-only deployments while preserving access for legacy clients.

If you’re interested in implementing or extending Sixgate, I’d love to collaborate. Let’s build the bridge and let IPv6 finally stand on its own.

Coming soon: We’ll explore Sixgate’s security considerations — including abuse prevention, spoofing risks, and how gateway operators can balance openness with protection. If Sixgate is to be deployed in the wild, it must be safe as well as simple.

Credits
🦉 Written and published by me, Bill P. Godfrey.
✍️ Editorial assistance from Echoquill, my robotic assistant.

Sixgate – IPv6 Without Leaving Anyone Behind

The internet is slowly, stubbornly moving toward IPv6. Server operators are increasingly comfortable deploying services on IPv6-only infrastructure, but there’s a catch. Many clients still live in IPv4-only environments, especially those served by legacy ISPs or locked into older hardware. This creates a frustrating asymmetry. Any website going IPv6-only will risk cutting off a portion of their audience.

We’ve been here before. Windows XP did not support SNI and website operators had to dedicate a full IPv4 address to each secure domain. Until XP faded out, many sites avoided HTTPS entirely. IPv6 faces a similar hesitation. Operators won’t go IPv6-only while legacy clients remain stranded.

IPv6 was meant to free us from the confines of IPv4, yet ISPs happy to maintain the status quo are holding everyone back.

Sixgate is a proposal to change that.

Before we dig deeper, I am painfully aware that the idea seems a little obvious. I can’t help the feeling that this must have been thought of already but there’s some problem which is why I can’t find any discussion of it. Maybe I am the first to think of this. I humbly await comments telling me exactly how wrong I am.

🌉 What Is Sixgate?

Sixgate is a lightweight mechanism that allows IPv4-only clients to reach IPv6-only servers without requiring IPv4 infrastructure. It works by letting clients automatically discover and use a gateway operated by the server network to tunnel IPv6 packets over IPv4.

Here’s how Sixgate bridges the gap:

  1. The client attempts to connect to a website and receives only an IPv6 address from DNS. The client knows that it cannot connect, either due to a missing IPv6 configuration or a previously failed attempt.
  2. The client performs a second DNS query, asking for a special SRV record associated with the IPv6 address, published in the same zone as reverse DNS for that IP address.
  3. The SRV record returns an IPv4-accessible gateway, operated by the website’s network.
  4. The client wraps its IPv6 packet in a UDP envelope and sends it to the gateway.
  5. The gateway unwraps the packet, rewrites the source address to its own IPv6 identity and forwards it to the server.
  6. Responses follow the same path in reverse, allowing the IPv4-only client to communicate seamlessly with the IPv6-only service.

🛠 Practical Realities

Even an IPv6-only cluster will need a single IPv4 address to operate the gateway. That’s a small concession. Far less costly than maintaining IPv4 across all services. The gateway becomes a focused point of compatibility, not a sprawling legacy burden. The gateway itself need not be part of the server farm it supports, but should be close to the same network path that normal IPv6 traffic takes.

Additionally, DNS itself must remain reachable over IPv4, at least for the foreseeable future. Clients need to resolve both the original IPv6 address and the SRV record for the gateway. Fortunately, DNS infrastructure is already deeply entrenched in IPv4, and this proposal doesn’t require any changes to that foundation.

“Ulysses, Ulysses, soaring through all the galaxies. In search of Earth, flying into the night.”

🚀 Why Sixgate Matters

The beauty of Sixgate is that it shifts the burden away from ISPs and toward software. Updating an operating system or browser to support this fall-back logic is vastly easier than convincing hundreds of ISPs to overhaul their networks. Software updates can be rolled out in weeks. ISP transitions take years — sometimes decades.

By giving server operators a graceful way to drop IPv4, we accelerate the transition without leaving legacy clients behind. It’s a bridge, not a wall.

🔭 What Comes Next?

This is a sketch — but it’s one that could be prototyped quickly. A browser extension, a client library, or even a reference implementation could demonstrate its viability. From there, it could be standardized, adopted, and quietly become part of the internet’s connective tissue.

If you’re a server operator dreaming of an IPv6-only future, this might be your missing piece. And if you’re a protocol designer or systems thinker, I’d love to hear your thoughts.

Let’s build the bridge — and let IPv6 finally stand on its own.

Part 1½. Answering the question of why we need this when IPv6-over-IPv4 tunnel services already exist. (Added after I had already written part 2.)

Part Two. For those curious about how Sixgate works under the hood — from SRV record discovery to stateless response routing.

Credits:
📸 “Snow Scot” by Peeja. (With permission.)
🤖 Microsoft Copilot for rubber-ducking and help refining my text.

Rufus – An Adventure in Downloading

I needed to make a bootable USB. Simple task, right? My aging Windows 10 machine couldn’t upgrade to 11 and Ubuntu seemed like the obvious next step.

Downloading Rufus, the tiny tool everyone recommends, turned out to be less of a utility and more of a trust exercise. Between misleading ads, ambiguous signatures and the creeping dread of running an EXE as administrator, I found myself wondering how something so basic became so fraught?

Click Here to regret everything…

Here’s what I saw when I browsed to rufus.ie:

“Her weapons were her crystal eyes, making every man mad.”

I’ve redacted the name of the product being advertised. This isn’t really about them and they may very well be legitimate advertisers. Point is, I have no idea if they’re dodgy or not. I’m here to download the Rufus app thanks very much. I’m fortunate enough to have been around long enough to recognise an ad but I wonder how someone else who might be following instructions to “Download Rufus from rufus.ie” would cope.

Wading through the ads, I found the link that probably had the EXE I actually wanted. Hovering my pointer over the link had a reasonable looking URL. I clicked…

“She’s got it! Yeah baby, she’s got it!”

At some point during my clicking around, two EXEs were deposited in my “Downloads” folder. It looked like the same EXE but one had “(1)” on the end, so I had probably downloaded it twice. I right-clicked the file and looked for the expected digital signature: Akeo Consulting.

Even now, am I quite certain that this “Akeo Consulting” is the right one? Could one of those dodgy-looking advertisers formed their own company that’s also called Akeo Consulting but in a different place, in order to get a legitimate digital signature onto their own EXE? And this is an executable I’d need to run as administrator, with no restrictions.

At the end of the day, I am complaining about something someone is doing for free. I can already hear the comments that I’m “free to build my own”. I know how much it costs to run a website, especially one that’s probably experiencing a sudden spike in traffic while people find they need to move from Windows 10.

I’m not blaming this project, I’m blaming society. If the Rufus Project had to choose between accepting advertiser money to keep the lights on or shutting down, I’m not going to tell them they should have chosen the latter option. But if this is where we are as a society, we’ve made a mistake along the way.

Credits:
🔨 The Rufus Project, for their (at the end of the day) very useful app.
🤖 Microsoft Copilot for spelling/grammar checking, reviews and rubber-ducking.

The UKIP Effect – How to Win by Not Winning

UKIP, the British political party, is a failure.

Since their launch in the 90s, their peak of power was two members of parliament but now have none. Their most prominent party leader, who you might reasonably expect to be the most successful, ran for election to parliament a total of seven times and won none of them. (He was since elected as an MP in 2024, long after leaving his leadership post.)

Since his resignation as leader in 2016, they’ve churned through around 11 party leaders. It looks like they keep resigning after getting bored, the shortest lasting 18 days. I’ve even joked about running for party leader myself.

“Bill for UKIP Leader! Why not give a Euro-enthusiast a go?”

And yet, the UK is not a member of the European Union anymore. Far from being a failure, they might be the most successful political party ever!

How did it happen?

“They sailed away for a year and a day, to the land where the bong-tree grows.”

Winning Without Seats

They never governed. They never held power. They barely held seats. And yet, they bent the arc of British history.

UKIP didn’t win elections, they warped them. Like a black hole in the political field, they pulled the discourse toward Euroscepticism and toward a referendum. The mainstream parties, once content to grumble about bendy bananas, suddenly found themselves triangulating around Nigel Farage’s pint-and-flag persona. Not because they admired it, but because it worked.

And that’s the strange success. UKIP didn’t need to win, they needed to make winning impossible without addressing their cause. They became the ghost in every campaign room. The reason David Cameron promised a referendum that he never wanted to hold nor take any responsibility for.

It’s a kind of political parasitism. Infect the host, rewrite the DNA, and vanish. No seats, no legacy, no infrastructure, but plenty of impact. They proved that you don’t need to govern to change everything. You just need to haunt the system long enough that it starts to dream your dreams.

It only makes sense when you understand the machinery it exploited. In the UK, we don’t vote for a prime minister but for our local MP. The party with enough MPs forms the government. That means national sentiment is filtered through hundreds of local contests, each decided by a simple rule: whoever gets the most votes wins.

This is a system that favours blunt choices. Within each constituency, if two candidates share similar views, they risk splitting the vote and handing victory to someone neither of them agrees with. This is called the “spoiler effect”. It means that standing on principle can mean losing on numbers.

The result is that simplicity is rewarded and nuance punished. The more finely you slice a viewpoint, the less likely it is to win. UKIP thrived in this system not by winning seats, but by threatening to spoil them.

The big parties had to steal their clothes. A Conservative candidate in a marginal seat couldn’t afford to ignore UKIP’s talking points. A handful of disgruntled voters could very realistically swing the result.

Then came the Brexit referendum. It didn’t happen because UKIP demanded it, but because the Conservative Party feared what would happen if they didn’t do it. UKIP didn’t force the vote but haunted it into existence.

It’s a strange kind of democratic judo to use the system’s quirks against itself. Exploit the spoiler effect not to win, but to warp. They made their presence felt in every calculation, every campaign leaflet, every doorstep conversation.

Once the goal of leaving the EU was achieved, the party collapsed under the weight of its own irrelevance, but the effect remains. I’ll call it The UKIP Effect. A reminder that in politics, influence isn’t always measured in seats. Sometimes it’s measured in the shadows you cast.

What’s the lesson for similar small parties with large goals?

“Ever singing, marching onwards, victors in the midst of strife. Joyful music leads us sunward, the triumphant song of life.”

Spoil to Win!

The UKIP effect is not for the faint-hearted. It demands conviction so strong that you’re willing to risk empowering your ideological opponents to make your point unavoidable.

It’s a kind of political brinkmanship. You stand on the edge and yell “No Compromises!” If you do it loudly enough, consistently enough, the big parties start to twitch. Not because you’ll win but because you’ll make them lose.

For The Party of Women, The Reclaim Party and The Jeremy Corbyn People’s Front, the lesson is clear but uncomfortable. If you want to shift the narrative, you must be willing to spoil it. That means resisting tactical voting and accepting that your vote might help elect someone you oppose — you’re playing the long game. It’s about changing the menu, not choosing from it.

It only works if your core policy is sharp, singular, and resonant. UKIP had one idea, to leave the European Union. Everything else was window dressing. That clarity gave them gravitational pull. Without it, you’re just another star in the political sky.

The question for small parties is “What are we willing to lose to make our idea unavoidable?”

And maybe — just maybe — the answer is everything.

Credits:
📸 “Cats Eyes” by Ivan Phan. (Creative Commons)
📸 “Haunting Resilience” by Dr Partha Sarath Sahana. (Creative Commons)
👥 Thanks to my friends Andrew Williams and Heather McKee for their feedback.
🤖 Thanks to Microsoft Copilot for reviewing my drafts, random philosophical mischief and taking a break from destroying all humanity.

Dear string-to-integer parsers…

These are very useful functions that any language with distinct string and integer types will include in their standard library. Pass in a string with decimal digits and it’ll return the equivalent in the binary integer form that you can do mathematics with.

I’d like to make a modest proposal that I’d find very useful, and maybe you, dear reader, would too.

“The rich man in his castle, the poor man at his gate. He made them, high or lowly, and ordered their estate.”

Who me?

Specifically, I’m thinking of parser functions that work like this…

ParseInt("123");      // 123.
ParseInt("-456");     // -456.
ParseInt("Rutabaga"); // Rejected.

Note that by “rejected”, it could mean anything in practice as long as the response is distinct from returning a number. Maybe it throws an exception, maybe it returns null, maybe it also returns a Boolean to tell you if the string value was valid or not.

Point is, I’m thinking of parser functions that have two distinct kinds of result. A success result that includes the integer value, or a rejection result. No half-way results.

I will acknowledge that there are standard library functions that will keep going along the string gobbling digits, until it hits a non-digit and the response tells the caller what number it found and where that first non-digit is. Those are very useful for tokenizing loops as part of compilers, but my idea would break that interface too much. If that’s your variety of parser, sorry, but this post isn’t for you.

Also, I’m thinking of functions that parse as decimal. Maybe you have optional flags that allow you to specify what base to use, but it parses as decimal by default. I’m concerned only with the decimal mode of operation.

Round Numbers and “E” Notation

You might be familiar with “E” notation if you work with very large or very small floating point numbers. This is a shorthand for scientific notation where the letter E translates to “times ten to the power of”.

FloatParse("1E3");    // 1000.0
FloatParse("5E-3");   // 0.005
FloatParse("1E+100"); // One Googol.

This notation is handy for decimal round numbers. If you want to type in a billion, instead of having to count as you press the zero key on your keyboard over and over, you could instead type “1E9”. Which one of the following numbers is a billion? Can you tell at a glance?

100000000 10000000000 1000000000

The problem is that E notation is stuck in the floating-point world. I’d really like it if anywhere I could type an integer (such as in an electronic form) and I want to type a large round number, I could use E notation instead.

For that to work, the functions that convert strings to integers need to allow this.

Pinning it down

Okay, we’re all software engineers here. Let’s talk specifics.

If the string supplied to the function is of the form (mantissa)"E"(exponent), with the mantissa in the range 1-9 and the exponent from zero to however high your integer type gets, then instead of rejecting the string, return the integer value this E notation string represents.

Add the usual range checks (for example, 9E18 for a signed 64-bit integer) and do the right thing when there’s a minus sign character at the start and we’re done.

“But there might be code depending on values like that being rejected!”

That’s a fair concern. I am advocating for a change in behaviour in the standard library after all.

I am seeking only to change behaviour in the domain of inputs that would otherwise produce a rejection response.

If IntParse("1E3") used to return a rejection, but now it returns 1000, is that a bad thing? The user can already type "1000" but this time they wrote "1E3" instead. What’s the harm in carrying on as if they typed 1000 all along?

I can think of some pathological cases. Maybe the programmer wanted to limit an input to 1000, but instead of using the less-than operator on the integer like a normal person, they check that the length of the string less than 4. "1E9" would pass validation but a billion would be returned. It seems unlikely that anyone would do that in practice.

The parser function might be used not to actually use the integer returned, but instead act as a validator. You have a string and you want to know if the string is a valid sequence of decimal digits or not. If that’s what you need, the integer-parser is maybe the wrong tool for that. Parsers will already be a little flexible about the range of allowable inputs, allowing leading plusses or zero digits and commas grouping digits into triples. If you care that a string is actually the one canonical ASCII representation of a number or not, then I would follow the parse with a test converting the integer back into a string and checking it matches the input string.

“E might be a hex digit.”

Your function returns the number 7696 for the input "1E10" and not ten billion? What you’ve got there is a hex parser, not a decimal parser. E notation only make sense in the world of decimal numbers.

If your decimal parser automatically switches to hex parsing if it sees ‘A’ to ‘F’ characters, then you’ve got a parser that’s unreliable for hex number strings. A lot of hex numbers contain only the ‘0’ to ‘9’ digits. If my code gets a hex number as input, I’m going to call the hex parser. Some supposed general purpose parser isn’t going to know if "1000" should return 1000, 4096 or 8 and will need to be told.

While we’re on the subject of hex numbers, I may be following this up with a proposal that “H” should mean “times 16 to the power of” in a similar style, but that’ll be for another day.

 “Delores, I live in fear. My love for you is so overpowering. I’m afraid that I will disappear.”

“Because counting to nine is really hard”

So there’s my suggestion. In short, I’m fed up of having to count to nine when I want to type a billion and having to check by counting the little row of identical ovals on the screen. I look forward to comments telling me how wrong I am.

Picture Credits
📸 “Swift” by Tristan Ferne. (Creative Commons.)
📸 “Kibo Summit, Mount Kilimanjaro, Tanzania” by Ray in Manila. (Creative Commons.)

(Also, a billion is a one followed by nine zeros. Anyone who says it has twelve zeros is quite wrong.)

What type of UUID should I use?

UUIDs, Universally Unique IDs, are handy 128 bit IDs. Their values are unique, universally, hence the name.

(If you work with Microsoft, you call them GUIDs. I do primarily think of them as GUIDs, but I’m going to stick with calling them UUIDs for this article, as I think that name is more common.)

These are useful for IDs. Thanks to their universal uniqueness, you could have a distributed set of machines, each producing their own IDs, without any co-ordination necessary, even completely disconnected from each other, without worrying about any of those IDs colliding.

When you look at a UUID value, it will usually be expressed in hex and (because reasons) in hyphen-separated groups of 8-4-4-4-12 digits.

7

You can tell which type of UUID it is by looking at the highlighted digit, the first of the middle of the four-digit blocks. That digit always tells you which type of UUID you’re looking at. This one is a type 7 because that hex-digit is a 7. If it was a 4 it would be a type 4.

As I write this, there are 8 types to chose from. But which type should you use? Type 7. Use type 7. If that’s all you came for, you can stop here. You ain’t going to need the others.

Type 7 – The one you actually want.

This type of UUID was designed for assigning IDs to records on database tables.

The main thing about type 7 is that the first block of bits are a time stamp. Since time always goes forward [citation needed] and the timestamp is right at the front, each UUID you generate will have a bigger value than the last one.

This is important for databases, as they are optimized for “ordered” IDs like this. To oversimplify it, each database table has an index tracking each record by its ID, allowing any particular record to be located quickly by flipping through the book until you get close to the one you wanted. The simplest place to add a new ID is to add it on the end and you can only do that if your new ID comes after all the previous ones. Adding a new record anywhere else will require that index to be reorganised to make space for that new one in the middle.

(You often see UUIDs criticised for being random and unordered, but that’s type 4. Don’t use type 4.)

The timestamp is 48 bits long and counts the number of milliseconds since the year 1970. This means we’re good until shortly after the year 10,000. Other than the 6 bits which are always fixed, the remaining 74 bits are randomness which is there so all the UUIDs created in the same millisecond will be different. (Except it is a little more complicated than that. Read the RFC.)

So there we are. Type 7 UUIDs rule, all other types drool. We done?

“I was born in a flame. Mama said that everyone would know my name. I’m the best you’ve ever had. If you think I’m burning out, I never am.”

Migrating from auto-incrementing IDs.

Suppose you have an established table with a 32-bit auto-incrementing integer primary key. You want to migrate to type 7 UUIDs but you still need to keep the old IDs working. A user might come along with a legacy integer ID and you still want to allow that request to keep working as it did before.

You could create a bulk of new type 7 UUIDs and build a new table that maps the legacy integer IDs to their new UUID. If that works for you, that’s great, but we can do without that table with a little bit of cleverness.

Let’s think about our requirements:

  1. We want to deterministically convert a legacy ID into its UUID.
  2. These UUIDs are in the same order as the original legacy IDs.
  3. New record’s UUIDs come after all the UUIDs for legacy records.
  4. We maintain the “universally unique”-ness of the IDs.

This is where we introduce type 8 UUIDs. The only rule of this type is that there are no rules. (Except they still have to be 128 bits and six of those bits must have fixed values. Okay, there are a few rules.) It is up to you how you construct this type of UUID.

Given our requirements, let’s sketch out how we want to layout the bits of these IDs.

The type 7 UUIDs all start with a 01 byte, until 2039 when they will start 02. They won’t ever start with a 00 byte. So to ensure these IDs are always before any new IDs, we’ll make the first four hex digits all zeros. The legacy 32-bit integer ID can be the next four bytes.

Because we want the UUIDs we create to be both deterministic and universally-unique, the remaining bits need to look random but not actually be random. Running a hash function over the ID and a fixed salt string will produce enough bits to fill in the remaining bits.

Now, to convert a legacy 32-bit ID into its equivalent UUID, we do the following:

  1. Start an array of bytes with two zero bytes.
  2. Append the four bytes of legacy ID, most significant byte first.
  3. Find the SHA of (“salt” + legacy ID) and append the first 10 bytes of the hash to the array.
  4. Overwrite the six fixed bits (in the hash area) to their required values.
  5. Put the 16 bytes you’ve collected into a UUID type.

And there we have it. When a user arrives with a legacy ID, we can deterministically turn it into its UUID without needing a mapping table or conversion service. Because of the initial zero bytes, these UUIDs will always come before the new type 7 UUIDs. Because the legacy ID bytes come next, the new UUIDs will maintain the same order as the legacy IDs. Because 74 bits come from a hash function with a salt as part of its input, universal-uniqueness is maintained.

What’s that? You need deterministic UUIDs but it isn’t as simple as dropping the bytes into place?

“You once thought of me as a white knight on his steed. Now you know how happy I can be.”

Deterministic UUIDs – Types 3 and 5.

These two types of UUID are the official deterministic types. If you have (say) a URL and you want to produce a UUID that represents that URL, these UUID types will do it. As long as you’re consistent with capital letters and character encoding, the same URL will always produce the same UUID.

The down-side of these types is that the UUID values don’t even try to be ordered, which is why I wrote the discussion of type 8 first. If the ordering of IDs is important, such as using them as primary keys, maybe think about doing it a different way.

Generation of these UUIDs work by hashing together a “namespace” UUID and the string you want to convert into a UUID. The hash algorithm is MD5 for type 3 or SHA1 for type 5. (In the case of SHA1, everything after the first 128 bits of hash are discarded.)

To use these UUIDs, suppose a user makes a request with a string value, you can turn that string into a deterministic UUID by running it through the generator function. That function will have two parameters, a namespace UUID (which could be a standard namespace or one you’ve invented) and the string to convert. That function will run the hash function over the input and return the result as a UUID.

These UUID types do the job they’re designed to do. Just as long as you’re okay with the values not being ordered.

Type 3 (MD5) or Type 5 (SHA1)?

There are pros and cons to each one.

MD5 is faster than SHA1. If you’re producing them in bulk, that may be a consideration.

MD5 is known to be vulnerable to collisions. If you have (say) a URL that hashes to a particular type 3 UUID, someone could construct a different URL that hashes to the same UUID. Is that a problem? If you’re the only one building these URLs that get hashed, then a hypothetical doer of evil isn’t going to get to have their bad URL injected in.

Remember, the point of a UUID is to be an ID, not something that security should be depending upon. Even the type 5 UUID throws away a big chunk of the bits produced, leaving only 122 bits behind.

If you want to hash something for security, use SHA256 or SHA3 and keep all the bits. Don’t use UUID as a convenient hashing function. That’s not what its for!

On balance, I would pick type 5. While type 3 is faster, the difference is trivial unless you’re producing IDs in bulk. You might think that MD5 collisions are impossible with the range of inputs you’re working with, but are you quite sure?

“I’ve seen this thing before, in my best friend and the boy next door. Fool for love and fool on fire.”

Type 4 – The elephant in the room

A type 4 UUID is one generated from 122 bits of cryptographic quality randomness. Almost all UUIDs you see out there will be of this type.

Don’t use these any more. Use type 7. If you’re the developer of a library that generates type 4 UUIDs, please switch it to generating type 7s instead.

Seriously, I looked for practical use cases for type 4 UUIDs. Everything I could come up was either better served by type 7, or both types came out as the same. I could not come up with a use-case where type 4 was actually better. (Please leave a comment if you have one.)

Except I did think of a couple of use-cases, but even then, you still don’t want to use type 4 UUIDs.

Don’t use UUIDs as secure tokens.

You shouldn’t use UUIDs as security tokens. They are designed to be IDs. If you want a security token, you almost certainly have a library that will produce them for you. The library that produces type 4 UUIDs uses one internally.

When you generate a type 4 UUID, six bits of randomness are thrown away in order to make it a valid UUID. It takes up the space of a 128 bit token but only has 122 bits of randomness.

Also, you’re stuck with those 122 bits. If you want more, you’d have to start joining them together. And you should want more – 256 bits is a common standard length for a reason.

But most of all, there’s a risk that whoever wrote the library that generates your UUIDs will read this article and push out a new version that generates type 7 UUIDs instead. Those do an even worse at being security tokens.

I’m sure they’d mention it in that library’s release notes but are you going to remember this detail? You just want to update this one library because a dependency needs the new version. You tested the new version and it all works fine but suddenly your service is producing really insecure tokens.

Maybe the developers of UUID libraries wouldn’t do that, precisely because of the possibility of misuse, but that’s even more reason to not use UUIDs as security tokens. We’re holding back progress!

In Conclusion…

Use type 7 UUIDs.

“Only to find the night-watchman, unaware of his presence in the building.”

Picture Credits.
📸 “Night Ranger…” by Doug Bowman. (Creative Commons)
📸 “Cat” by Adrian Scottow. (Creative Commons)
📸 “Cat-36” by Lynn Chan. (Creative Commons)
📸 “A random landscape on a random day” by Ivo Haerma (Creative Commons)
📸 “Elena” by my anonymous wife. (With Permission)

I want a less powerful programming language for Christmas.

I’m writing this because I’m hoping someone will respond, telling me that what I want already exists. I have a specific itch and my suspicion is that developing a whole programming language and runtime is the only way to scratch that itch.

Please tell me I’m wrong.

Dear Father Christmas…

If you’ve ever written a web service, you’ve almost certainly had situations where you’ve taken a bunch of bytes from a completely untrusted stranger and passed those bytes into a JSON parser. What’s more you’ll have done that without validating the bytes first.

Processing your inputs without sanitizing it first? Has Bobby Tables taught us nothing?

You can do this safely because that JSON parser will have been designed to be used in this manner and will be safe in the face of hostile inputs. If you did try feeding the bytes of an EXE file into a JSON parser, it’ll very quickly reject it complaining that “MZ” isn’t an opening brace and refuse to continue beyond that. The worst a hostile user could do is put rude messages inside the JSON strings.

{ "You": "A complete \uD83D\uDC18 head!" }

Now take that idea and think about what if you did have a web service where completely unauthenticated users could use any request body they liked and your service would run that request body in a copy of Python as the program source code.

Hopefully, you’ve just now remarked that it would be a very bad idea, up there with Napoleon’s idea to make his brother the King of Spain. But that’s exactly what I want to do. I want to write a web service that accepts Python code from complete strangers and actually run that code.

(And also make my brother the King of Spain. He’d be great!)

“Hang on to your hopes, my friend. That’s an easy thing to say. But if your hopes should pass away, simply pretend that you can build them again.”

At the gates of dawn

Some time in the early 90s, I had a game called “C Robots”.

This is a game where four tanks are in an arena, driving around and firing missiles at each other. But instead of humans controlling those tanks, each tank was controlled by a program written by the human player. The game controller would keep track of each tank and any missiles in flight, passing back control to each tank’s controller program to let it decide what its next move will be.

For 90s me, programming a robot appealed to me but the tank battle part did not appeal so much. I really wanted to make a robot to play other games that might not involve tanks. At the time, there were two games I enjoyed playing with school friends, Dots-and-Boxes and Rummy. I had an idea of what made good strategies for these specific games, so I thought building those strategies into code might make for a good intellectual exercise.

Decades passed and I built a simple game controller system which I (rather pompously) called “Tourk“. I had made a start on the controllers for a handful of games but I hadn’t gotten around to actually writing actual competitive players, only simple random ones that were good for testing. I imagined that before long, people would write their own players, send them in to me and I’d compile them all together. After I’d let it ran for a million games in a tournament I’d announce the winner.

If anyone had actually written a player and sent it in, my first step would have been to inspect the submitted code thoroughly. These would have been actual C programs and could have done anything a C program could do, including dropping viruses on my hard disk, so inspecting that code would have been very important. Looking back, I’m glad no-one actually did that.

But this was one thing C Robots got right, even if it wasn’t planned that way. Once it compiled the player’s C code, it would run that code in a restricted runtime. Your player code could never go outside its bounds because there’s no instructions in the C Robots runtime to do that. This meant that no-one could use this as an attack vector. (But don’t quote me on that. I’ve not actually audited the code.)

“I never ever ask where do you go. I never ever ask what do you do. I never ever ask what’s in your mind. I never ever ask if you’ll be mine.”

Will the runtime do it?

Could maybe the dot-net runtime or the Python runtime have the answer?

This was one of the first questions I asked on the (then) new Stack Overflow. The answer sent me to Microsoft’s page on “Code Access Security” and if you follow that link now, it says this feature is no longer supported.

Wondering more recently if Python might have an option to do what I wanted, I asked on Hacker News if there was a way to run Python in the way I wanted. There were a few comments but it didn’t get enough up-votes and disappeared fairly quickly. What little discussion we had was more to do with a side issue than the actual question I was asking.

I do feel that the answer might still be here. There’s quite possibly some flag on the runtime that will make any call to an extern function impossible. The Python runtime without the “os” package would seem to get 90% of the way there, but I don’t know enough about it to be certain enough that this won’t have left any holes open.

“We’re all someone’s daughter. We’re all someone’s son.”

Sanitize Your inputs?

Maybe I should listen to Bobby Tables and sanitize my inputs before running them.

Keep the unrestricted runtime, but before we invoke it to run the potentially hostile code, scan it to check it won’t do any bad things.

Simple arithmetic in a loop? That’s fine.
Running a remote access trojan? No.

Once it has passed the test, you should be able to allow the code to run, confident it won’t do anything bad because you’ve already checked it won’t. This approach appeals to me because once that initial test has passed the code for non-hostility, we can allow the runtime to go at full speed.

The problem with this approach are all the edge cases and finding that line between simple arithmetic and remote-access-trojans. You need to allow enough for the actually-not-hostile code to do useful things, but not enough that a hostile user could exploit.

Joining strings together is fine but passing that string into eval is not.
Writing text to stdout is fine but writing into a network socket is not.

Finding that line is going to be difficult. The best approach would be to start with nothing-is-allowed, but when considering what to add, first investigate what would be possible by adding that facility to allowed list. Because it can be used for bad things, eval would never be on that allowed list.

If there’s a function with a million useful things it can do but one bad thing, that function must never be allowed.

“We can go where we want to. A place they’ll never find. We can act like we come from out of this world and leave the real one far behind.”

Ask the Operating System?

I told a colleague about this post while I was still writing it and he mentioned that operating systems can have restrictions placed on programs it runs. He showed me his Mac and there was a utility that listed all the apps he was running and all the permissions it had. It reminded me that my Android phone does something similar. If any apps wants to interact with anything outside its realm, it has to ask first. This is why I’m happy to install apps on my Android phone but not on my Windows laptop.

This would be great, but how do I, a numpty developer, harness this power? What do I do if I want to launch a process (such as the Python runtime) but with all the permissions turned off? It feels like this will be the solution but my searching isn’t coming up with a practical answer.

My hope is that there’s a code library whose job it is to launch processes in this super restricted mode. It’ll work out which OS it is running on, do the necessary magic OS calls and finally launch the process in that super-restricted mode.

“If I was an astronaut I’d be floating in mid air. A broken heart would just belong to someone else down there. I would be the centre of my lonely universe. I’m only human and I’m crashing in the dark.”

Mmmm coffee!

The good people developing web browsers back in the 90s had the same need as me. They wanting to add a little interactivity to web pages, but without having to wait for a round trip back to the server over dialup, so they came up with a language they named JS.

As you read this page, your browser is running some code I supplied to you. That code can’t open up your files on your local device. If anyone did actually find a way to do that, the browser developers would call that a serious bug and push out an emergency update. So could JS be the solution I’m looking for?

As much as it sounds perfect, that JS runtime is inside the browser. If I have some JS code in my server process, how do I get that code into a browser process? Can I even run a web browser on a server without some sort of desktop environment?

The only project I know of where someone has taken JS outside of a browser is node-js. That might be the answer but I have written programs using node-js that load and save files. If this is the answer then I’d need to know how to configure the runtime to run the way I want.

“Play the game, fight the fight, but what’s the point on a beautiful night? Arm in arm, hand in hand. We all stand together.”

Is there an answer?

I began this post expressing my suspicion that the solution is to write my own runtime, designed from first-principles to run in a default-deny mode. I still wonder if that’s the case. I hope someone will read this post and maybe comment with the unknown option on the Python runtime that does exactly what I want.

In the meantime, I have another post in the works as with my thoughts on how this runtime and programming language could work. I hope I can skip it.

Gronda-Gronda.

Picture Credits
📸 “Snow Scot” by Peeja. (With permission.)
📸 “Meeting a Robot” by my anonymous wife. (With permission)
📸 “Great Dane floppy ears” by Sheila Sund. (Creative Commons)
📸 “Fun with cling film” by Elizabeth Gomm. (Creative Commons)
📸 “Rutabaga Ball 2” by Terrence McNally. (Creative Commons)
📸 “Nice day for blowing the cobwebs off” by Jurassic Snark. (With permission.)

(And just in case advocating for your brother to be made King of Spain is treason or something, I don’t actually want to do that. It was a joke.)

Why do we repeatedly hash passwords in a loop?

If you’re building a website that allows the pubic to log-in, you need to store your passwords so you can check your users are who they say they are when logging-in. This is my introduction to the current state of the art for storing your users’ passwords in your database.

Make It Someone Else’s Problem

I’ll say this right up front, the best way is to get someone else to do it. Use an outsourced service or install a component that deals with the whole thing. You’ll have passed responsibility to someone who’s very speciality is already knowing everything I’ve written here, as well as all the nuances I’ve skipped over.

But that’s not always acceptable. Sometimes you need to build your own system.

“Knock three times on the ceiling if you want me.
Twice on the pipe, if the answer is no.”

Doing it wrong – Store the password

We’ll start with various wrong ways to do it and build up to the right way.

The first wrong way is to store the password in the clear in your database. You’ll have a table of users, add a string field called “password” and store it there. When a user comes along to log in, you compare the supplied password with the actual password and if they match you let the user in.

One problem is that your user database might leak and all your users password are right there. Do you trust all your insiders? Are you quite sure that all components of your system are leakproof? There’s little you can do to stop a trusted insider from having a peak at just one user’s record. What are you going to do, have no trusted insiders?

“Enemy lasagne. Robust below wax. Semiautomatic aqua. Accompany slacks. White coffee gymnastic. Motorcycle unibrow.
Existential plastic. Extra nightly cow.”

Better but still wrong – Hash the password first

If the problem is that someone knows what everyone’s password is, the solution is for no-one to know what anyone’s password is. As luck would have it, there’s a branch of cryptography that’s perfect for this – the hash function. Instead of storing the password in clear, store a hash instead.

A hash function takes a string of characters and mixes all them up in a repeatable way. Unlike encryption, there isn’t a key and you can’t get the original text back. For example, the SHA1 hash of “rutabaga” is “C8A52CE9 1ED32187 38D43809 B31856AB 619E0ABE”. This will be the same today, tomorrow and forever.

The first time a user registers with your service, they supply you the password they want to use, but before writing it to the database, you run a hash over the supplied password and store the result of the instead. Later, the same user comes back and types in their password again. You run the hash over the supplied password and compare it against the hash in your database. If they match, let the user in.

The other useful property of a hash function is that it is irreversible. There’s no secret key to go from “C8A52E9…” back into “rutabaga”. If all you have is the hash, the original text is lost. Now, if an attacker gets a copy of the user database, they have a problem. All they have is the result of the hash and there’s no way to get the original password back from that – and that’s what you need to log in.

“Music’s on, I’m waking up, we fight the fire, then we burn it up,
and it’s over now, we got the love, there’s no sleeping now.”

Except you can reverse a hash.

The Bad Guys: “Tell us the original password that produced this hash result!”
Hash Functions: “We’re designed for that to be impossible.”
The Bad Guys: “Really?”
Hash Functions: “Yes. You’d literally have to try every single possible input and store the result in a lookup table.”
The Bad Guys: “Okay, we’ll do that.”
Hash Function: “Wait, what?”

Hash functions are designed to be one-way. There’s no hint of what the original text could have been because none of that information survives. But there’s a way around that detail.

A problem with humans is that we are predictable in how we think of passwords. We like words from the dictionary, patterns etc. From this knowledge, we can make a list of all these likely potential passwords. Then for each likely password, find the hash of each one, storing the original text against each hash. This might sound like a lot of computation but we only need to do it once.

Finally, the clever bit, sort the list by the hash.

There we have it. The Bumper Book of Password Hashes. Each hash, one per line, with the text that went into produce that hash next to it.

     The Bumper Book of Password Hashes - SHA1 Edition
C8A52CE9062E654D02D08B9AE56BE5A16A3C7663 =)Ve06Va
C8A52CE90DCA962E41A8E164EB649207206E553B h30/4h50
C8A52CE91C77FB87893CA977353A65F8C406AA69 Ds?F8Jjj
C8A52CE91DBEF9713D61537840CC58F0D8D4B3E9 HPpxLGT/mevs
C8A52CE9295EA07D4AD52A1DF84D442E3E106A37 7-KDA-)0:0aF
C8A52CE92A077F5A2944D4E20A2953FDF56570F0 oG6Ksdc
C8A52CE9351C7D852B09CAE66B1B0D9DB204838A =C0V/5et9s
C8A52CE93B4B6AE01A8985C2FE96371967A40DCB -j0880YA3b
C8A52CE9426DC99277D114CAB37971B65D18F8B9 a^cY=e3%u67
C8A52CE9451C0944A561CC5E76D0D62C61083A56 4UJKQLwhuQ
C8A52CE950C8A276987097569EB248D2E4D68EB9 hTu3sbX3g
C8A52CE958C75B126B6D9772D1C430DF6B5CC785 V7Qej5q8Ly3r
C8A52CE962606E0ED8617AD9A6C8C9C84FF202FE rUEOy6ZW
C8A52CE968E0BEC0CEF5E1D93AF7EFD1987C60CF =hL)F#sDN08r
C8A52CE97214314C4DE54168B6D5F7CCEDF35D3E NXd241ts
C8A52CE9733B9EED59E95F3A0BCA6594B5BB0841 N0KjP2n7j
C8A52CE98E9DA6676C5B0009312A9EF289305236 ue52C^Jc0aA)2N#
C8A52CE992F14E7020DC40896AB929D838A118F3 1s/2J00HT)Xt#t5
C8A52CE9A4BF120810B7D9B24F77031184CCF01C 06PeP)r8cr
C8A52CE9A9D6D36FA9A1BC2D376A91B221DE83B2 c8DL?Tbr)23:t*
C8A52CE9B37385A2CC1894A083E87ACD2EDCE026 z0VoZ/Sw1orL
C8A52CE9CCC4088AFEAD6534B827FDB657706EA9 nnNeYZLxeg
C8A52CE9CD75EB936FA3B0EEED25B1322C913996 k0StwVCnwA
C8A52CE9DE11A6B3739D726FE29B067DC1DD470C KL%du)YF
C8A52CE9E99069CC192876B00788632AE75965E6 mdVg/C2Y
C8A52CE9ED8DE406CD60F95D5B1B64CD3C3BF1AC DnY73:8e
C8A52CE9F8CE32484D73B7B179048E3FB91061EB 4#cN6bYVV)b#*^9
C8A52CE9FE3FDC64F6F088D2DC41EB85CF97D465 Y866(-)5
                                       Page 3,366,268,137

Suppose you’ve got someone’s password hash which starts with C8A52CE9 and you want to know what password produces that particular hash. Grab the book and flick through until you get to the pages with all the hashes that begin C8A52CE9. If it was included in the original set, the original password will be listed right there.

(This technique is better known as a “Rainbow Table”. My name is better.)

A popular service for looking up password hashes is known as Google. You might have heard of it.

Google search for a hash result, returning "rutabaga" as the obvious source of the hash.
“Full moon in the city and the night was young. I was hungry for love, I was hungry for fun. I was hunting you down and I was the bait. When I saw you there I didn’t mean to hesitate.”

Good but not quite done – Salt

A way to make the Bumper Book of Password Hashes obsolete is to add “salt” to the hash. Instead of hashing only the password, also add some random bytes into the mix. Instead of hashing the password, hash the combination of the password and the salt too.

The book might list the hash of “rutabaga”, but it isn’t going to list the hash of “(lots of randomness)rutabaga”. That simple act of adding some random bytes means the book is now useless.

If an attacker manages to find a leaked copy of the user database, they will be able to start guessing and checking on their own. If you make sure each user has different salt bytes, then any computational effort the attacker does make is only good for one single user. Even if an attacker found the password of one user, there’s nothing to bring forward to attack the next user. Even if both users use the same password, the attacker has to start again.

Hopefully that extra effort is long enough for the service admins to realise the leak has happened and start replacing passwords.

How long? Let’s make their job even harder.

“Matthew and Son, the work’s never done, there’s always something new. The files in your head, you take them to bed, you’re never ever through. And they’ve been working all day, all day, all day!”

The state of the art – Password Stretching

Through the long journey, we’ve arrived at the current state of the art.

These are open standards and your web platform almost certainly has a library that implements most of them. This article isn’t going to recommend one over another. We’ll just say they’re all pretty darn good except for the ones which are not. (Okay, start by searching for “PBKDF2” and see where it leads you.)

The hash functions we’ve encountered so far are fast. They’re designed that way. For passwords, what we really want is something slow. You think being deliberately slow is a bad thing, but let’s follow this rabbit down the hole.

Instead of a nice fast hash like SHA1, we’re going to use SHA1000. It’s just like SHA1 in terms of being one-way and such. The difference is that it is so badly designed it takes a thousand times more processing time to finish.

So why on earth would we use such a badly designed hash? The answer is that not only do you have to spend the processing time running it, so does your attacker. They were already looking at spending a large amount of processing time going through every word in the dictionary looking for a password. By using SHA1000 instead, you’ve just multiplied their workload by a thousand!

These password stretching algorithms aren’t actually badly designed hashes, but they are configurable for how difficult you want them to be. PBKDF2 can be set to have a number of rounds. One round is the same workload as SHA1. Three hundred thousand rounds is a lot more.

Imagine you’re storing your passwords with PBKDF2 set to 300,000 rounds and each user has a unique salt. When a user logs in, you look up that user’s salt and start running the PBKDF2 code for 300,000 loops with the supplied password. If the end result matches the expected result, you allow the user in.

For an attacker with a leaked copy of each user’s salt and expected hash, they can start guessing and checking over and over. Try each word in the dictionary and see if the result matches the expected result for each one. The attacker is faced with a ridiculous amount of computer time to go through all of that.

Now we’ve caught up, let’s head over to part two.

Picture Credits:
📸 “Password” by mk_is_here.
📸 “Equal in stature” by Kevin Dooley.
📸 “IMG_3310” by oFace Killah.
📸 “Entropy” by Robert Nunnally.
📸 “A rainbow in salty air” by torne.