What type of UUID should I use?

UUIDs, Universally Unique IDs, are handy 128 bit IDs. Their values are unique, universally, hence the name.

(If you work with Microsoft, you call them GUIDs. I do primarily think of them as GUIDs, but I’m going to stick with calling them UUIDs for this article, as I think that name is more common.)

These are useful for IDs. Thanks to their universal uniqueness, you could have a distributed set of machines, each producing their own IDs, without any co-ordination necessary, even completely disconnected from each other, without worrying about any of those IDs colliding.

When you look at a UUID value, it will usually be expressed in hex and (because reasons) in hyphen-separated groups of 8-4-4-4-12 digits.

7

You can tell which type of UUID it is by looking at the highlighted digit, the first of the middle of the four-digit blocks. That digit always tells you which type of UUID you’re looking at. This one is a type 7 because that hex-digit is a 7. If it was a 4 it would be a type 4.

As I write this, there are 8 types to chose from. But which type should you use? Type 7. Use type 7. If that’s all you came for, you can stop here. You ain’t going to need the others.

Type 7 – The one you actually want.

This type of UUID was designed for assigning IDs to records on database tables.

The main thing about type 7 is that the first block of bits are a time stamp. Since time always goes forward [citation needed] and the timestamp is right at the front, each UUID you generate will have a bigger value than the last one.

This is important for databases, as they are optimized for “ordered” IDs like this. To oversimplify it, each database table has an index tracking each record by its ID, allowing any particular record to be located quickly by flipping through the book until you get close to the one you wanted. The simplest place to add a new ID is to add it on the end and you can only do that if your new ID comes after all the previous ones. Adding a new record anywhere else will require that index to be reorganised to make space for that new one in the middle.

(You often see UUIDs criticised for being random and unordered, but that’s type 4. Don’t use type 4.)

The timestamp is 48 bits long and counts the number of milliseconds since the year 1970. This means we’re good until shortly after the year 10,000. Other than the 6 bits which are always fixed, the remaining 74 bits are randomness which is there so all the UUIDs created in the same millisecond will be different. (Except it is a little more complicated than that. Read the RFC.)

So there we are. Type 7 UUIDs rule, all other types drool. We done?

“I was born in a flame. Mama said that everyone would know my name. I’m the best you’ve ever had. If you think I’m burning out, I never am.”

Migrating from auto-incrementing IDs.

Suppose you have an established table with a 32-bit auto-incrementing integer primary key. You want to migrate to type 7 UUIDs but you still need to keep the old IDs working. A user might come along with a legacy integer ID and you still want to allow that request to keep working as it did before.

You could create a bulk of new type 7 UUIDs and build a new table that maps the legacy integer IDs to their new UUID. If that works for you, that’s great, but we can do without that table with a little bit of cleverness.

Let’s think about our requirements:

  1. We want to deterministically convert a legacy ID into its UUID.
  2. These UUIDs are in the same order as the original legacy IDs.
  3. New record’s UUIDs come after all the UUIDs for legacy records.
  4. We maintain the “universally unique”-ness of the IDs.

This is where we introduce type 8 UUIDs. The only rule of this type is that there are no rules. (Except they still have to be 128 bits and six of those bits must have fixed values. Okay, there are a few rules.) It is up to you how you construct this type of UUID.

Given our requirements, let’s sketch out how we want to layout the bits of these IDs.

The type 7 UUIDs all start with a 01 byte, until 2039 when they will start 02. They won’t ever start with a 00 byte. So to ensure these IDs are always before any new IDs, we’ll make the first four hex digits all zeros. The legacy 32-bit integer ID can be the next four bytes.

Because we want the UUIDs we create to be both deterministic and universally-unique, the remaining bits need to look random but not actually be random. Running a hash function over the ID and a fixed salt string will produce enough bits to fill in the remaining bits.

Now, to convert a legacy 32-bit ID into its equivalent UUID, we do the following:

  1. Start an array of bytes with two zero bytes.
  2. Append the four bytes of legacy ID, most significant byte first.
  3. Find the SHA of (“salt” + legacy ID) and append the first 10 bytes of the hash to the array.
  4. Overwrite the six fixed bits (in the hash area) to their required values.
  5. Put the 16 bytes you’ve collected into a UUID type.

And there we have it. When a user arrives with a legacy ID, we can deterministically turn it into its UUID without needing a mapping table or conversion service. Because of the initial zero bytes, these UUIDs will always come before the new type 7 UUIDs. Because the legacy ID bytes come next, the new UUIDs will maintain the same order as the legacy IDs. Because 74 bits come from a hash function with a salt as part of its input, universal-uniqueness is maintained.

What’s that? You need deterministic UUIDs but it isn’t as simple as dropping the bytes into place?

“You once thought of me as a white knight on his steed. Now you know how happy I can be.”

Deterministic UUIDs – Types 3 and 5.

These two types of UUID are the official deterministic types. If you have (say) a URL and you want to produce a UUID that represents that URL, these UUID types will do it. As long as you’re consistent with capital letters and character encoding, the same URL will always produce the same UUID.

The down-side of these types is that the UUID values don’t even try to be ordered, which is why I wrote the discussion of type 8 first. If the ordering of IDs is important, such as using them as primary keys, maybe think about doing it a different way.

Generation of these UUIDs work by hashing together a “namespace” UUID and the string you want to convert into a UUID. The hash algorithm is MD5 for type 3 or SHA1 for type 5. (In the case of SHA1, everything after the first 128 bits of hash are discarded.)

To use these UUIDs, suppose a user makes a request with a string value, you can turn that string into a deterministic UUID by running it through the generator function. That function will have two parameters, a namespace UUID (which could be a standard namespace or one you’ve invented) and the string to convert. That function will run the hash function over the input and return the result as a UUID.

These UUID types do the job they’re designed to do. Just as long as you’re okay with the values not being ordered.

Type 3 (MD5) or Type 5 (SHA1)?

There are pros and cons to each one.

MD5 is faster than SHA1. If you’re producing them in bulk, that may be a consideration.

MD5 is known to be vulnerable to collisions. If you have (say) a URL that hashes to a particular type 3 UUID, someone could construct a different URL that hashes to the same UUID. Is that a problem? If you’re the only one building these URLs that get hashed, then a hypothetical doer of evil isn’t going to get to have their bad URL injected in.

Remember, the point of a UUID is to be an ID, not something that security should be depending upon. Even the type 5 UUID throws away a big chunk of the bits produced, leaving only 122 bits behind.

If you want to hash something for security, use SHA256 or SHA3 and keep all the bits. Don’t use UUID as a convenient hashing function. That’s not what its for!

On balance, I would pick type 5. While type 3 is faster, the difference is trivial unless you’re producing IDs in bulk. You might think that MD5 collisions are impossible with the range of inputs you’re working with, but are you quite sure?

“I’ve seen this thing before, in my best friend and the boy next door. Fool for love and fool on fire.”

Type 4 – The elephant in the room

A type 4 UUID is one generated from 122 bits of cryptographic quality randomness. Almost all UUIDs you see out there will be of this type.

Don’t use these any more. Use type 7. If you’re the developer of a library that generates type 4 UUIDs, please switch it to generating type 7s instead.

Seriously, I looked for practical use cases for type 4 UUIDs. Everything I could come up was either better served by type 7, or both types came out as the same. I could not come up with a use-case where type 4 was actually better. (Please leave a comment if you have one.)

Except I did think of a couple of use-cases, but even then, you still don’t want to use type 4 UUIDs.

Don’t use UUIDs as secure tokens.

You shouldn’t use UUIDs as security tokens. They are designed to be IDs. If you want a security token, you almost certainly have a library that will produce them for you. The library that produces type 4 UUIDs uses one internally.

When you generate a type 4 UUID, six bits of randomness are thrown away in order to make it a valid UUID. It takes up the space of a 128 bit token but only has 122 bits of randomness.

Also, you’re stuck with those 122 bits. If you want more, you’d have to start joining them together. And you should want more – 256 bits is a common standard length for a reason.

But most of all, there’s a risk that whoever wrote the library that generates your UUIDs will read this article and push out a new version that generates type 7 UUIDs instead. Those do an even worse at being security tokens.

I’m sure they’d mention it in that library’s release notes but are you going to remember this detail? You just want to update this one library because a dependency needs the new version. You tested the new version and it all works fine but suddenly your service is producing really insecure tokens.

Maybe the developers of UUID libraries wouldn’t do that, precisely because of the possibility of misuse, but that’s even more reason to not use UUIDs as security tokens. We’re holding back progress!

In Conclusion…

Use type 7 UUIDs.

“Only to find the night-watchman, unaware of his presence in the building.”

Picture Credits.
📸 “Night Ranger…” by Doug Bowman. (Creative Commons)
📸 “Cat” by Adrian Scottow. (Creative Commons)
📸 “Cat-36” by Lynn Chan. (Creative Commons)
📸 “A random landscape on a random day” by Ivo Haerma (Creative Commons)
📸 “Elena” by my anonymous wife. (With Permission)

I want a less powerful programming language for Christmas.

I’m writing this because I’m hoping someone will respond, telling me that what I want already exists. I have a specific itch and my suspicion is that developing a whole programming language and runtime is the only way to scratch that itch.

Please tell me I’m wrong.

Dear Father Christmas…

If you’ve ever written a web service, you’ve almost certainly had situations where you’ve taken a bunch of bytes from a completely untrusted stranger and passed those bytes into a JSON parser. What’s more you’ll have done that without validating the bytes first.

Processing your inputs without sanitizing it first? Has Bobby Tables taught us nothing?

You can do this safely because that JSON parser will have been designed to be used in this manner and will be safe in the face of hostile inputs. If you did try feeding the bytes of an EXE file into a JSON parser, it’ll very quickly reject it complaining that “MZ” isn’t an opening brace and refuse to continue beyond that. The worst a hostile user could do is put rude messages inside the JSON strings.

{ "You": "A complete \uD83D\uDC18 head!" }

Now take that idea and think about what if you did have a web service where completely unauthenticated users could use any request body they liked and your service would run that request body in a copy of Python as the program source code.

Hopefully, you’ve just now remarked that it would be a very bad idea, up there with Napoleon’s idea to make his brother the King of Spain. But that’s exactly what I want to do. I want to write a web service that accepts Python code from complete strangers and actually run that code.

(And also make my brother the King of Spain. He’d be great!)

“Hang on to your hopes, my friend. That’s an easy thing to say. But if your hopes should pass away, simply pretend that you can build them again.”

At the gates of dawn

Some time in the early 90s, I had a game called “C Robots”.

This is a game where four tanks are in an arena, driving around and firing missiles at each other. But instead of humans controlling those tanks, each tank was controlled by a program written by the human player. The game controller would keep track of each tank and any missiles in flight, passing back control to each tank’s controller program to let it decide what its next move will be.

For 90s me, programming a robot appealed to me but the tank battle part did not appeal so much. I really wanted to make a robot to play other games that might not involve tanks. At the time, there were two games I enjoyed playing with school friends, Dots-and-Boxes and Rummy. I had an idea of what made good strategies for these specific games, so I thought building those strategies into code might make for a good intellectual exercise.

Decades passed and I built a simple game controller system which I (rather pompously) called “Tourk“. I had made a start on the controllers for a handful of games but I hadn’t gotten around to actually writing actual competitive players, only simple random ones that were good for testing. I imagined that before long, people would write their own players, send them in to me and I’d compile them all together. After I’d let it ran for a million games in a tournament I’d announce the winner.

If anyone had actually written a player and sent it in, my first step would have been to inspect the submitted code thoroughly. These would have been actual C programs and could have done anything a C program could do, including dropping viruses on my hard disk, so inspecting that code would have been very important. Looking back, I’m glad no-one actually did that.

But this was one thing C Robots got right, even if it wasn’t planned that way. Once it compiled the player’s C code, it would run that code in a restricted runtime. Your player code could never go outside its bounds because there’s no instructions in the C Robots runtime to do that. This meant that no-one could use this as an attack vector. (But don’t quote me on that. I’ve not actually audited the code.)

“I never ever ask where do you go. I never ever ask what do you do. I never ever ask what’s in your mind. I never ever ask if you’ll be mine.”

Will the runtime do it?

Could maybe the dot-net runtime or the Python runtime have the answer?

This was one of the first questions I asked on the (then) new Stack Overflow. The answer sent me to Microsoft’s page on “Code Access Security” and if you follow that link now, it says this feature is no longer supported.

Wondering more recently if Python might have an option to do what I wanted, I asked on Hacker News if there was a way to run Python in the way I wanted. There were a few comments but it didn’t get enough up-votes and disappeared fairly quickly. What little discussion we had was more to do with a side issue than the actual question I was asking.

I do feel that the answer might still be here. There’s quite possibly some flag on the runtime that will make any call to an extern function impossible. The Python runtime without the “os” package would seem to get 90% of the way there, but I don’t know enough about it to be certain enough that this won’t have left any holes open.

“We’re all someone’s daughter. We’re all someone’s son.”

Sanitize Your inputs?

Maybe I should listen to Bobby Tables and sanitize my inputs before running them.

Keep the unrestricted runtime, but before we invoke it to run the potentially hostile code, scan it to check it won’t do any bad things.

Simple arithmetic in a loop? That’s fine.
Running a remote access trojan? No.

Once it has passed the test, you should be able to allow the code to run, confident it won’t do anything bad because you’ve already checked it won’t. This approach appeals to me because once that initial test has passed the code for non-hostility, we can allow the runtime to go at full speed.

The problem with this approach are all the edge cases and finding that line between simple arithmetic and remote-access-trojans. You need to allow enough for the actually-not-hostile code to do useful things, but not enough that a hostile user could exploit.

Joining strings together is fine but passing that string into eval is not.
Writing text to stdout is fine but writing into a network socket is not.

Finding that line is going to be difficult. The best approach would be to start with nothing-is-allowed, but when considering what to add, first investigate what would be possible by adding that facility to allowed list. Because it can be used for bad things, eval would never be on that allowed list.

If there’s a function with a million useful things it can do but one bad thing, that function must never be allowed.

“We can go where we want to. A place they’ll never find. We can act like we come from out of this world and leave the real one far behind.”

Ask the Operating System?

I told a colleague about this post while I was still writing it and he mentioned that operating systems can have restrictions placed on programs it runs. He showed me his Mac and there was a utility that listed all the apps he was running and all the permissions it had. It reminded me that my Android phone does something similar. If any apps wants to interact with anything outside its realm, it has to ask first. This is why I’m happy to install apps on my Android phone but not on my Windows laptop.

This would be great, but how do I, a numpty developer, harness this power? What do I do if I want to launch a process (such as the Python runtime) but with all the permissions turned off? It feels like this will be the solution but my searching isn’t coming up with a practical answer.

My hope is that there’s a code library whose job it is to launch processes in this super restricted mode. It’ll work out which OS it is running on, do the necessary magic OS calls and finally launch the process in that super-restricted mode.

“If I was an astronaut I’d be floating in mid air. A broken heart would just belong to someone else down there. I would be the centre of my lonely universe. I’m only human and I’m crashing in the dark.”

Mmmm coffee!

The good people developing web browsers back in the 90s had the same need as me. They wanting to add a little interactivity to web pages, but without having to wait for a round trip back to the server over dialup, so they came up with a language they named JS.

As you read this page, your browser is running some code I supplied to you. That code can’t open up your files on your local device. If anyone did actually find a way to do that, the browser developers would call that a serious bug and push out an emergency update. So could JS be the solution I’m looking for?

As much as it sounds perfect, that JS runtime is inside the browser. If I have some JS code in my server process, how do I get that code into a browser process? Can I even run a web browser on a server without some sort of desktop environment?

The only project I know of where someone has taken JS outside of a browser is node-js. That might be the answer but I have written programs using node-js that load and save files. If this is the answer then I’d need to know how to configure the runtime to run the way I want.

“Play the game, fight the fight, but what’s the point on a beautiful night? Arm in arm, hand in hand. We all stand together.”

Is there an answer?

I began this post expressing my suspicion that the solution is to write my own runtime, designed from first-principles to run in a default-deny mode. I still wonder if that’s the case. I hope someone will read this post and maybe comment with the unknown option on the Python runtime that does exactly what I want.

In the meantime, I have another post in the works as with my thoughts on how this runtime and programming language could work. I hope I can skip it.

Gronda-Gronda.

Picture Credits
📸 “Snow Scot” by Peeja. (With permission.)
📸 “Meeting a Robot” by my anonymous wife. (With permission)
📸 “Great Dane floppy ears” by Sheila Sund. (Creative Commons)
📸 “Fun with cling film” by Elizabeth Gomm. (Creative Commons)
📸 “Rutabaga Ball 2” by Terrence McNally. (Creative Commons)
📸 “Nice day for blowing the cobwebs off” by Jurassic Snark. (With permission.)

(And just in case advocating for your brother to be made King of Spain is treason or something, I don’t actually want to do that. It was a joke.)

Why do we repeatedly hash passwords in a loop?

If you’re building a website that allows the pubic to log-in, you need to store your passwords so you can check your users are who they say they are when logging-in. This is my introduction to the current state of the art for storing your users’ passwords in your database.

Make It Someone Else’s Problem

I’ll say this right up front, the best way is to get someone else to do it. Use an outsourced service or install a component that deals with the whole thing. You’ll have passed responsibility to someone who’s very speciality is already knowing everything I’ve written here, as well as all the nuances I’ve skipped over.

But that’s not always acceptable. Sometimes you need to build your own system.

“Knock three times on the ceiling if you want me.
Twice on the pipe, if the answer is no.”

Doing it wrong – Store the password

We’ll start with various wrong ways to do it and build up to the right way.

The first wrong way is to store the password in the clear in your database. You’ll have a table of users, add a string field called “password” and store it there. When a user comes along to log in, you compare the supplied password with the actual password and if they match you let the user in.

One problem is that your user database might leak and all your users password are right there. Do you trust all your insiders? Are you quite sure that all components of your system are leakproof? There’s little you can do to stop a trusted insider from having a peak at just one user’s record. What are you going to do, have no trusted insiders?

“Enemy lasagne. Robust below wax. Semiautomatic aqua. Accompany slacks. White coffee gymnastic. Motorcycle unibrow.
Existential plastic. Extra nightly cow.”

Better but still wrong – Hash the password first

If the problem is that someone knows what everyone’s password is, the solution is for no-one to know what anyone’s password is. As luck would have it, there’s a branch of cryptography that’s perfect for this – the hash function. Instead of storing the password in clear, store a hash instead.

A hash function takes a string of characters and mixes all them up in a repeatable way. Unlike encryption, there isn’t a key and you can’t get the original text back. For example, the SHA1 hash of “rutabaga” is “C8A52CE9 1ED32187 38D43809 B31856AB 619E0ABE”. This will be the same today, tomorrow and forever.

The first time a user registers with your service, they supply you the password they want to use, but before writing it to the database, you run a hash over the supplied password and store the result of the instead. Later, the same user comes back and types in their password again. You run the hash over the supplied password and compare it against the hash in your database. If they match, let the user in.

The other useful property of a hash function is that it is irreversible. There’s no secret key to go from “C8A52E9…” back into “rutabaga”. If all you have is the hash, the original text is lost. Now, if an attacker gets a copy of the user database, they have a problem. All they have is the result of the hash and there’s no way to get the original password back from that – and that’s what you need to log in.

“Music’s on, I’m waking up, we fight the fire, then we burn it up,
and it’s over now, we got the love, there’s no sleeping now.”

Except you can reverse a hash.

The Bad Guys: “Tell us the original password that produced this hash result!”
Hash Functions: “We’re designed for that to be impossible.”
The Bad Guys: “Really?”
Hash Functions: “Yes. You’d literally have to try every single possible input and store the result in a lookup table.”
The Bad Guys: “Okay, we’ll do that.”
Hash Function: “Wait, what?”

Hash functions are designed to be one-way. There’s no hint of what the original text could have been because none of that information survives. But there’s a way around that detail.

A problem with humans is that we are predictable in how we think of passwords. We like words from the dictionary, patterns etc. From this knowledge, we can make a list of all these likely potential passwords. Then for each likely password, find the hash of each one, storing the original text against each hash. This might sound like a lot of computation but we only need to do it once.

Finally, the clever bit, sort the list by the hash.

There we have it. The Bumper Book of Password Hashes. Each hash, one per line, with the text that went into produce that hash next to it.

     The Bumper Book of Password Hashes - SHA1 Edition
C8A52CE9062E654D02D08B9AE56BE5A16A3C7663 =)Ve06Va
C8A52CE90DCA962E41A8E164EB649207206E553B h30/4h50
C8A52CE91C77FB87893CA977353A65F8C406AA69 Ds?F8Jjj
C8A52CE91DBEF9713D61537840CC58F0D8D4B3E9 HPpxLGT/mevs
C8A52CE9295EA07D4AD52A1DF84D442E3E106A37 7-KDA-)0:0aF
C8A52CE92A077F5A2944D4E20A2953FDF56570F0 oG6Ksdc
C8A52CE9351C7D852B09CAE66B1B0D9DB204838A =C0V/5et9s
C8A52CE93B4B6AE01A8985C2FE96371967A40DCB -j0880YA3b
C8A52CE9426DC99277D114CAB37971B65D18F8B9 a^cY=e3%u67
C8A52CE9451C0944A561CC5E76D0D62C61083A56 4UJKQLwhuQ
C8A52CE950C8A276987097569EB248D2E4D68EB9 hTu3sbX3g
C8A52CE958C75B126B6D9772D1C430DF6B5CC785 V7Qej5q8Ly3r
C8A52CE962606E0ED8617AD9A6C8C9C84FF202FE rUEOy6ZW
C8A52CE968E0BEC0CEF5E1D93AF7EFD1987C60CF =hL)F#sDN08r
C8A52CE97214314C4DE54168B6D5F7CCEDF35D3E NXd241ts
C8A52CE9733B9EED59E95F3A0BCA6594B5BB0841 N0KjP2n7j
C8A52CE98E9DA6676C5B0009312A9EF289305236 ue52C^Jc0aA)2N#
C8A52CE992F14E7020DC40896AB929D838A118F3 1s/2J00HT)Xt#t5
C8A52CE9A4BF120810B7D9B24F77031184CCF01C 06PeP)r8cr
C8A52CE9A9D6D36FA9A1BC2D376A91B221DE83B2 c8DL?Tbr)23:t*
C8A52CE9B37385A2CC1894A083E87ACD2EDCE026 z0VoZ/Sw1orL
C8A52CE9CCC4088AFEAD6534B827FDB657706EA9 nnNeYZLxeg
C8A52CE9CD75EB936FA3B0EEED25B1322C913996 k0StwVCnwA
C8A52CE9DE11A6B3739D726FE29B067DC1DD470C KL%du)YF
C8A52CE9E99069CC192876B00788632AE75965E6 mdVg/C2Y
C8A52CE9ED8DE406CD60F95D5B1B64CD3C3BF1AC DnY73:8e
C8A52CE9F8CE32484D73B7B179048E3FB91061EB 4#cN6bYVV)b#*^9
C8A52CE9FE3FDC64F6F088D2DC41EB85CF97D465 Y866(-)5
                                       Page 3,366,268,137

Suppose you’ve got someone’s password hash which starts with C8A52CE9 and you want to know what password produces that particular hash. Grab the book and flick through until you get to the pages with all the hashes that begin C8A52CE9. If it was included in the original set, the original password will be listed right there.

(This technique is better known as a “Rainbow Table”. My name is better.)

A popular service for looking up password hashes is known as Google. You might have heard of it.

Google search for a hash result, returning "rutabaga" as the obvious source of the hash.
“Full moon in the city and the night was young. I was hungry for love, I was hungry for fun. I was hunting you down and I was the bait. When I saw you there I didn’t mean to hesitate.”

Good but not quite done – Salt

A way to make the Bumper Book of Password Hashes obsolete is to add “salt” to the hash. Instead of hashing only the password, also add some random bytes into the mix. Instead of hashing the password, hash the combination of the password and the salt too.

The book might list the hash of “rutabaga”, but it isn’t going to list the hash of “(lots of randomness)rutabaga”. That simple act of adding some random bytes means the book is now useless.

If an attacker manages to find a leaked copy of the user database, they will be able to start guessing and checking on their own. If you make sure each user has different salt bytes, then any computational effort the attacker does make is only good for one single user. Even if an attacker found the password of one user, there’s nothing to bring forward to attack the next user. Even if both users use the same password, the attacker has to start again.

Hopefully that extra effort is long enough for the service admins to realise the leak has happened and start replacing passwords.

How long? Let’s make their job even harder.

“Matthew and Son, the work’s never done, there’s always something new. The files in your head, you take them to bed, you’re never ever through. And they’ve been working all day, all day, all day!”

The state of the art – Password Stretching

Through the long journey, we’ve arrived at the current state of the art.

These are open standards and your web platform almost certainly has a library that implements most of them. This article isn’t going to recommend one over another. We’ll just say they’re all pretty darn good except for the ones which are not. (Okay, start by searching for “PBKDF2” and see where it leads you.)

The hash functions we’ve encountered so far are fast. They’re designed that way. For passwords, what we really want is something slow. You think being deliberately slow is a bad thing, but let’s follow this rabbit down the hole.

Instead of a nice fast hash like SHA1, we’re going to use SHA1000. It’s just like SHA1 in terms of being one-way and such. The difference is that it is so badly designed it takes a thousand times more processing time to finish.

So why on earth would we use such a badly designed hash? The answer is that not only do you have to spend the processing time running it, so does your attacker. They were already looking at spending a large amount of processing time going through every word in the dictionary looking for a password. By using SHA1000 instead, you’ve just multiplied their workload by a thousand!

These password stretching algorithms aren’t actually badly designed hashes, but they are configurable for how difficult you want them to be. PBKDF2 can be set to have a number of rounds. One round is the same workload as SHA1. Three hundred thousand rounds is a lot more.

Imagine you’re storing your passwords with PBKDF2 set to 300,000 rounds and each user has a unique salt. When a user logs in, you look up that user’s salt and start running the PBKDF2 code for 300,000 loops with the supplied password. If the end result matches the expected result, you allow the user in.

For an attacker with a leaked copy of each user’s salt and expected hash, they can start guessing and checking over and over. Try each word in the dictionary and see if the result matches the expected result for each one. The attacker is faced with a ridiculous amount of computer time to go through all of that.

Now we’ve caught up, let’s head over to part two.

Picture Credits:
📸 “Password” by mk_is_here.
📸 “Equal in stature” by Kevin Dooley.
📸 “IMG_3310” by oFace Killah.
📸 “Entropy” by Robert Nunnally.
📸 “A rainbow in salty air” by torne.

What’s a kWh? (And other money-saving tips)

When I was at school, they taught us how electricity works only as part of science lessons. It was something future engineers might need, yet we all use electricity at home every day.

The problem with electricity is we’re a little bit separated from its cost. With cars, we fill up the car with fuel and pay for it right there and then. With electricity, we use many different appliances which all add up to an eye-watering bill at the end of the month.

This is my guide to what everyone needs to know about electricity.

Introducing the kWh.

Electricity is sold in units of “kWh”. We’ll come to exactly what those three letters mean later on but for now, imagine your electricity is being delivered to you in barrels, each one a standard size called the “kWh”. Think about your local electricity station and imagine one of these “kWh” barrels of electricity being hooked up to the wires that lead to your home. When a barrel empties, someone comes along and replaces it with a new full barrel.

The “kWh” has a scientific definition that all electricity suppliers agree on. It is so ubiquitous that if any supplier decides to use a different unit, they’re most likely up to something dodgy.

How much is a single kWh barrel of electricity? Check your electric bill. Here’s mine…

The 45¾p per day standing charge is fixed. It doesn’t matter how much or how little I use; I still have to pay that 45¾p every single day and there’s little I can do about that other than maybe switch providers.

More interesting is the 33p per kWh. At the end of each month, they count up all the empty barrels of electricity I’ve gone through and bill me 33p for each one. I’ll use that figure in my examples but do look up your own rate and replace it with however much your kWh costs.

Also note that it doesn’t matter how quickly I go through each barrel of electricity. If I go away for a few days leaving everything except the fridge switched off, it will take a lot longer to finish that barrel than when I’m home and everything is switched on. Either way, they still charge me 33p once that barrel is empty.

We’ll now pull apart those three letters, but always keep in mind that metaphor of barrels of electricity hooked up to the wires leading to your house.

Little barrels on the hillside.
Little barrels full of ‘tricity…

What Watt?

The W is short for the “Watt”, named after James Watt who invented them. If you’ve seen a capital W or “Watts” or “Wattage”, they all mean the same thing. The number of Watts any electrical appliance has is a measure of the rate of consumption of electricity over time. If you like, think of it as the speed that something eats electricity coming out of the outlet on the wall.

"High power fan heater. 3000 Watt. 2 heat settings, 1500W/3000W. Adjustable thermostat with overheat cut out protection."

This heater consumes electricity at a rate of 3000 Watts, or 1500 Watts if you use the low setting. Because one Wattage figure is twice as much as the other, you can safely assume that the high setting consumes electricity exactly twice as fast as the low setting.

Lightbulb in packaging. "15 year warrantee. 13.5W. 100W replacement. 1527 Lumens."

This lightbulb consumes electricity at a rate of 13.5 Watts, yet it shines as brightly as an old-fashioned 100-Watt filament lightbulb. Quite the improvement!

A quick exercise: Find an electrical item in your home and look up its Wattage figure. It might be on a label or written on the original packaging. If you can’t find it written down, try using a search engine.

Ooh kay!

1 kW (or one kiloWatt) means exactly the same thing as 1000 W. Adding “k” to “W” to make “kW” means the amount is multiplied by one thousand. The heater above could have “3 kW” printed on the box instead of “3000 W”. It would mean exactly the same thing.

Devices that draw a small amount of electricity like lightbulbs or phone chargers are usually rated in Watts, while larger devices that eat a lot of electricity like ovens or electric car chargers are typically rated in kW. They mean the same thing underneath.

Whoever makes your electrical appliances might have a personal preference for small numbers in “kW” or big numbers in “W”. The manufacturer of that heater probably wants to emphasise how well it heats, so they prefer to use the bigger number of “3000 W” instead of “3 kW”. More W equals more heat.

Our hours

The last letter is “h”, which is short for an “hour”, named after its inventor Sir Claudius Hour. (At least that’s what a man at the pub told me. He might have been joking.)

You know what an hour is, don’t you? It’s the time it takes to watch a normal episode of Star Trek with ads. It’s how long it takes me to walk all the way around my local country park if I don’t stop. It’s the time it takes to walk my sister’s dog before she (the dog) gets tired.

“And I would walk 500 miles and I would walk 500 miles more.”

All together now!

Now we know what each letter of “kWh” stands for, let’s bring them all together. A “kWh” is the amount of electricity consumed by a 1000 W appliance if it is left on for an hour.

Find an appliance that’s rated at 1 kW. Plug it in and switch it on for an hour and then switch it off. You’ll have used exactly one kWh and your electricity bill will have gone up by 33p. (Or whatever your supplier charges.)

Let’s work out a practical example. Recall that 3000W heater from earlier. How much do you think it costs to run that heater for five hours on the high setting? We’ll ignore practical realities like the built-in thermostat and assume it goes for five hours straight with no gaps.

3000W is the same as 3 kW and we want to run it for 5 hours, paying 33p for each kWh. Multiply those numbers together:

3 kW × 5 h × 33 p/kWH = 495p (or roughly £5.)

Try this calculation yourself. Pick an electrical appliance in your home and find its rated wattage. Think about how long you switch it on for and work out how much it costs to use it for that amount of time.

Applying the knowledge

It can be tempting to look at how much some appliances like heaters or ovens cost and conclude the only way to save money is to be cold and not eat. I hope that’s not the conclusion you draw. The benefit of knowing how much something costs to use is that you can make informed choices.

Will buying an air fryer save you money when your kitchen already has an oven? Work out how much it costs to cook your favourite meal in the oven then do the same for an air fryer. If you know both in actual pennies, you can make an informed decision to make that purchase or not.

While the Wattage figure tells you the rate it consumes electricity, it may be that the higher Wattage appliance gets the job done faster. Say you have a choice of two kettles, one runs at 1 kW and the other at 3 kW, it may seem at first blush that the 1 kW kettle will cost less. However, if the 3 kW kettle gets the water boiled in a third of the time as the 1 kW kettle, they will cost the same to use.

Does your supplier offer a different service with more expensive electricity during the day and cheaper electricity overnight? Which appliances would you use overnight when the kWh barrels are cheaper? Would that save you money overall?

Many thanks to my wife and my brother Andrew for their helpful feedback. Thanks also to my local B&M store for the pictures of lightbulbs and heaters I took while shopping there.

Creative Commons Picture Credits:
📸 “saturday recycle” by Andrea de Poda.
📸 “sad kilo” by “p med”.

You don’t have to wear a blazer to school on a hot day!

Remember, if you’re going to school on a hot day, you can leave your blazer at home if it’s normally part of your uniform.

“When you say bronze doesn’t need to be chipped, my questions is this, doesn’t it?”
  1. Teachers who insist that you wear a blazer on hot days can be ignored. They have a callous disregard for the discomfort caused by excess thick layers in hot weather and such callousness does not deserve respect.

  2. Teachers who insist that you bring your blazer to school and carry it around can also be ignored. These ones might not be callous but they are ridiculous. Making you pointlessly carry around some heavy item? Why?

  3. If you’ve been given a detention for not having your blazer, you don’t have to turn up. Go home at the normal time and let the ridiculous teacher whine to themselves. You’ve not broken any rules.

  4. If you need to, show your teacher this page.

  5. If you are a teacher who has just been shown this page by one of the children in your care, please stop making them bring their blazers in on hot days. It is people like you who caused the rise in belief in the flat-earth. “If people in authority can be so wrong about blazers, maybe they’re also wrong about the earth being a globe.” If you really must enforce rules, why not good rules like the one about running with scissors? (If I’ve not convinced you are in error, maybe teaching isn’t right for you. Why not consider a career in cooking where you’re meant to be heating things up?)

  6. I’m not the one undermining teacher’s authority. Teachers who are under the delusion that blazers are required are undermining their own authority by attempting to enforce such ridiculous rules.

  7. Yes, I do know better than those teachers. Thank you for noticing.

Picture Credit:
📷 Close up of blazer pocket emblem for boys school group by “Kaye”.

My adventure into self web-hosting (Part 1)

If you had asked twenty-something me how he thought forty-something me would be hosting his website, he’d have predicted I had a rack of small servers in my attic, as part of a grid-computing business. (That’s what we called “cloud” computing back then.)

He’d have been disappointed to find out I’m using a shared web-hosting service, but that may change.

“The end of the day, remember the way, we stayed so close to the end, we’ll remember it was me and you ’cause we are gonna be…”

Over the Cliff

It all started when my article, Data-Mining Wikipedia for Fun and Profit made it to the top of Hacker News and stayed there for three hours. I was careful to try to not overburden the system by switching on an HTML cache. This way, visitors would only be served up static files without the server having to run the PHP code or talk to the database. Despite that, the server went down and I had to post a sheepish comment with a link to a mirror.

It was clear I was out-growing my current web-host. Despite my precautions, it couldn’t handle being popular for a few hours. Not only that, I’m a software developer and I wanted to develop software. The only practical choice on this service was PHP and I had long decided that life was too short for that.

I started looking at VM services as the natural next step on the ladder, but it was a chance discussion, again on Hacker News, that gave me an idea.

Clifford Stoll: “a heavy load on my raspberry-pi web server told me something was happening…”
Me: “your web server is a Raspberry PI, and its holding up while being on the HN front page?”
CS: “Hi Bill, Yep. Cloudflare is out front, so the actual load on the rasp-pi is mitigated by their content-delivery network.”

Suddenly, the idea of hosting a web server in my attic became real again. Reality had long since taught me that residential ISPs were no good for serious web hosting – but if there was a service that could deal with the bulk of GET requests and it could cover the occasional outage on my side from its cache, that’d change everything.

“Can you deal with my GET requests?”

Tunnelling

At the time, that Raspberry-Pi web server was on his residential ISP with a public IP address. That arrangement wouldn’t work for me as my own ISP didn’t allow their customers to run services like that. However, in that same comment thread, the very CTO of Cloudflare (John Graham-Cumming) mentioned to him that they had an new service that allowed their customers to VPN out to Cloudflare, making such port-forwarding shenanigans a thing of the past.

(As a not-quite a declaration of bias, Cloudflare are on my list of companies I would like to work for should my current day-job come to end. I am not (yet) an employee of Cloudflare and they’re not paying me to write this in any case. By the time you come to read this, that might have changed.)

This service is completely free. While I like not having to pay for things, it does make me a little nervous. This particular service isn’t going to be injecting ads into my site and I do understand how the free tier fits into their business model. But still, I’ve been burnt by free services suddenly disappearing before and you get no sympathy if you’ve become dependent on them. I kind of wish I could give them a few pounds each month, just in case.

Leaving such concerns to one side, I had a plan. Acquire a server and install it into one of the slots on my IKEA KALLAX unit the TV is sitting on. Plug it into my ISP’s router and once that’s running, install a web server along with the VPN software. I’ll finally be in charge of my very own web server, just like the twenty-something me thought I’d be.

“If I get to know your name, well I could trace your private number, baby. All I know is that to me, you look like you’re lots of fun. Open up your loving arms, I want some, want some. You spin me right round, baby, right round, like a record, baby, right round…”

Quiet!

I had acquired a second-hand PC for this purpose but once I got it home it was way too noisy. I needed a machine I could leave switched on 24/7 in the lounge where we watch TV. My server would have to be really quiet.

I also considered a Raspberry Pi, the same hardware Clifford Stoll used, but I wasn’t going to only be running a few WordPress instances. I had an idea I wanted to develop and I’d need a database with plenty of space for that to work. An SD card and maybe some USB storage wouldn’t cut it.

I’m not in particular hurry to buy it as I still want to plan some more before the new machine starts taking up space. It was while I was reading reviews for various machines when I had the craziest of crazy ideas.

“And as we sit here alone, looking for a reason to go on. It’s so clear that all we have now are our thoughts of yesterday. La, la la la…”

It comes with Windows

Any PC I could buy is going to come with Windows pre-installed and fully licensed. I was always going to replace it with a variety of Linux, but I wondered, why not keep the copy Windows?

Before you all think I’ve gone insane, there are a few benefits to doing it this way. I use Windows a lot for my day job so I’m familiar with its quirks and gotchas. Even though there’s a dot-net for Linux, my development machine runs Windows so there would be fewer surprises when the development machine runs the same OS as the production machine. For the handful of WordPress sites I wanted to run, there were docker images available. Finally, because it won’t be directly connected to the scary internet I wouldn’t have to panic when there’s an update.

But even as I’m writing this, I feel I’m going to regret doing it this way. I just know I’ll be writing part six of this series and it’ll be all about installing Linux on that server machine because there’s just one stupid thing I couldn’t get working on Windows. We shall see.

A foreshadowing?

Join me for part 2 of this series, where I’ll be experimenting with getting WordPress running from a Docker container. Wish me luck.

Picture Credits:
📸 “Kee-kaws”, by me.
📸 “Duke”, by my anonymous wife.
📸 “Haven Seafront, Great Yarmouth”, by me.
📸 “Quiet Couple” by Judith Jackson. (CC)
📸 “Blisworth Canal Festival, 2019”, by me.

My Incredibly Stupid Diary

🥇First Entry
⚾ Random Entry

Years ago, 2004 to 2007, I had a website. It was mildly popular – I counted the number of readers and found I had eleven regulars. I called it “The Incredibly Stupid Diary of Bill”, although I added a few friends as writers and “of Bill” very soon became “of Bill et al”.

I occasionally posted long form pieces, but mostly it was quick-and-short stuff that these days I would post to Facebook or Twitter. I used Blogger before it was BlogSpot. Back then, it worked by connecting to my web server and uploading HTML files over FTP. I’d leave my password configured with Blogger so that in case anyone commented, they could update the page with the comment without having to wait for me to allow it.

Along the way, I started a weekly feature – Animated Short of the Week . Each Sunday, I’d pick a Flash-based animation and post a link to it. These would usually be my favourite from the back-catalogue on AlbinoBlackSheep but it was something I really enjoyed doing. It would also become an incentive to post *something* as I wouldn’t want to have two animation post next to each other. I made the decision to stop posting them after 100 posts. It was becoming more and more difficult to find good animations and it felt like the quality was on the decline so 100 selections seemed a good place to stop.

“You may find yourself behind the wheel of a large automobile.”

Time passed and I eventually stopped using writing. I had a new hobby, making old-school YouTube videos. This was the day when videos were limited to ten minutes and there was no such thing as a professional YouTuber. You can see the decline from the last handful of posts – 80% of them are just links to my videos.

When I finally made the decision to moth-ball the site, I wrote one last post and published it. A few more comments were written and the servers at Blogger dutifully updated my website via FTP, but that was it. One day, I changed my password on the web server but didn’t update it on Blogger. That last revision would be fixed as it was left, with a non-functioning comments form to boot.

For a while, my website became nothing more than a bunch of links to my social media websites, although my old posts were still there if you knew the addresses, ready to respond to searches. By now it was a folder full of static files, just as it was left when Blogger did that last FTP connection.

Now, I’ve been reminded about that old website and I wanted to give it a bit of a tidy-up. There were several files all with very similar HTML structures. I wrote a program to loop through each file, remove obsolete stuff like the comments form, added a navigation gadget and made it a nice website again.

A lot of external links have since gone, so I wrote some code to change those links to archive.org links, using the time-stamp of the original post. I made an exception for the AlbinoBlackSheep links as the archive,org copies were all of the original Adobe Flash which doesn’t work any more, whereas the current AlbinoBlackSheep website uses updated video files.

I hope you like it. There is an awful lot of rubbish there but a few gems too. I’ll be making a few new posts reacting to some of the crazy stuff I wrote. Good times.

Start with the first post: Let’s try that again.
Or jump to a random post.

Data-Mining Wikipedia for Fun and Profit

It all started after watching one too many videos narrating the English monarchy, all starting from King William Ⅰ in 1066 as if he’s the first king of England. This annoys me as it completely disregards the handful of Anglo-Saxon kings of England who reigned before the Normans.

They’re Kings of England. If you’re going to make a list of the Kings of England, then you should include the Kings of England.

It was this that made me want to make a particular edit to both the King Alfred and Queen Elizabeth pages on Wikipedia, acknowledging each as related to the other. But what is their relationship and through who?

I went to the page for Queen Elizabeth Ⅱ and started following the Mother/Father links until I found my way to King Alfred, mostly going through the other kings of England. I counted 36 generations, but was there a shorter or even longer route?

Sounds like a job for some software!

Gâteau Brûlé.

Scanning Wikipedia

We have the technology.

  • Visual Studio 2019 and C#.
  • RestSharp, a library for downloading HTML.
  • HtmlAgilityPack, a library for parsing and extracting data from HTML.

With these libraries downloaded from nuget, I was able to write some very quick and dirty code that would download the HTML for the Wikipedia page of Queen Elizabeth II, storing the HTML in a cache folder to save re-downloading it again.

Once the HTML is downloaded (or read from the cache), HtmlAgilityPack can be called upon for the task of pulling items of data from the HTML. For example, the person’s full name, which is always the page’s only <H1>…</H1> element, can be extracted using one line of code:

string personName = 
    html
    .DocumentNode
    .Descendants()
    .Where(h => h.Name == "h1")
    .Single()
    .InnerText;

I used HtmlAgilityPack and LINQ in a similar way to pull out the Mother and Father for each person. The code would look for the info-box <TABLE>, then look inside for a <TH> with the text “Mother” or “Father”. It would then take a few steps backwards to look for the <TR> that the text is a part of and finally pull out all the links it can find inside.

With the links to the Queen Elizabeth’s mother and father, the code would add those links to a queue and the top-level would pull the next link and continue until the links runs out.

Calm down!

This section was added after initial publication.

I would hope that people don’t need to be told to be considerate, but please be considerate.

Before I started on this project, I checked Wikipedia’s robots.txt file. This told me that my project was acceptable, quoth: “Friendly, low-speed bots are welcome viewing article pages, but not dynamically-generated pages please.”

The article pages were exactly what I wanted. My code was already fairly low speed as it was all in a single thread. Nonetheless, I added a short delay after each download once I had worked the kinks out. I also set the User-Agent text to include my email address and phone number so Wikipedia server admins could raise an alarm with me personally if necessary.

As I was running my code in Visual Studio’s debug mode, I could leave the code running unattended (once I had observed it over the first hundred or so) with some breakpoints to stop everything until I could return to inspect what happened.

The most important were during examination of the response from Wikipedia. If the response was anything other than an 200/OK response (after redirects) or anything other than HTML, I wanted my code to stop dead until I can inspect what happened. Even if it happened overnight, I still what that response object in memory.

In the end, the bulk of the download took two days in a number of bursts. I’ll be sending a modest donation to the Wikimedia Foundation in thanks for accommodating my bizarre projects.

“She’s just a girl who says that I am the one…”

I made the decision here to only include people with an info-box. Extracting someone’s parents from free English text was a step too far. If you’re not notable enough to have an info-box with your parents listed, you’re not notable enough for this project. (Although I did find a couple of people who didn’t have a suitable info-box surprisingly early in the process. Rather than hack in an exception, I edited Wikipedia to include those people’s parents in their info-box, copying the link from elsewhere in the text.)

While that got me out of a small hole, more annoying was when the info-box listed “Parents” or “Parent(s)” instead of Mother and Father. I wanted to track matrilineal and patrilineal lines, so it was a little annoying to just have an individual’s parents with no clear indication of which one is which. I coded it so that if there’s only one one link, assume it is the father. If there’s two links, assume the father is the first one.

Because patriarchy.

“Also known as…”

Another issue was that some of the pages changed names. RestSharp would dutifully follow HTTP redirects, but I’d end up storing a page with one name but having a different name internally. This happened right away as the page for Queen Elizabeth links to her mother as “Elizabeth_Bowes-Lyon“, but once you follow the link, you end up at “Queen_Elizabeth_The_Queen_Mother“.

The HTML included a <LINK> tag named the “canonical reference”, so I could pull that out and use it as the primary key in my data structure. To keep the link between child and parent, it collects the aliases when the are detected and a quick reconciliation loop corrects the links after the initial loop completes.

King Alfred, also known as The Muffin Man.

From Alfred to Elizabeth.

Once I had a complete set of Wikipedia pages cached, the next step was to build a tree with all of the parental connections that lead from King Alfred to Queen Elizabeth. I knew that some non-people had crept in because someone’s parents would be listed as “(name) of (town)”, but that didn’t bother me as those towns wouldn’t have a mother or father listed and those loose ends would be discarded.

I wrote some code to walk the tree of connections. It started from Queen Elizabeth and recursively walked to each of the mother and father node. If a node ended on King Alfred, the complete chain would be added to the list of nodes.

With this reduced set in place, I churned through the nodes and generated a GraphViz file. For those who don’t know about it, this an app for producing graphs of connected bubbles. You tell it what bubbles you want and how they are connected and it automatically lays them out.

At this point, I was expecting a graph that would be mainly tall and thin and it would appear right here in this article. While family trees do grow exponentially, I wasn’t including every single relationship, only those that connect both of two individuals. If I were graphing the relationships between myself an a distant ancestor, I’d expect a single line, each parent handing over to their child. There would be a few bulges when third-or-so cousins marry. There, an individual’s two children would split off into separate lines, eventually reuniting with one ever-so-slightly inbred individual.

Yeah, that’s not what I got. This is the SVG file GraphViz generated for me. If you follow this link and are faced with a blank screen, scroll right until you find the King Alfred node. Then zoom out.

Aristocrats…

(The bubbles are all clickable, by the way.)

Count the Generations.

The graph was interesting but this wasn’t the primary objective of this exercise. I wanted to write “He is the n-times great-father of his current successor Queen Elizabeth.” on King Alfred’s Wikipedia page.

But what’s the n? I already had a collection of all the chains between so I just had to loop through them to find the longest and shortest chain. The longest chain has 45 links and the shortest chain has 31 links.

King Alfred is a 42-times great-grandfather of Queen Elizabeth Ⅱ.

(And also 28 times-great-grandfather. And everything in between.)

Here’s the simplified graph showing only those lines with exactly 45 links.

All the parental chains from Alfred to Elizabeth that have exactly 45 links.

“Let’s talk about sex.”

Earlier, I mentioned being annoyed that some info-boxes listed two parents instead of a mother and a father, requiring me to make assumptions that fathers are more likely to be included and put first, because these are aristocrats and society is quite patriarchal.

I still wanted to data-mine into matrilineal lines, so to check on those assumptions, I pulled out all of the people linked only in a “Parents” line of the info-box and checked they were all in order. The fathers all had manly names and the mothers all had womanly names. Seemed fine. But just to be sure, I queried my data structure for any individual that was listed as both a mother and a father, expecting that to happen from two different children’s pages.

There were several. Not only that, the contradicting links came from the same page. Someone apparently had the same individual as both his father and mother. Expecting to see the same person linked twice or a similar variety of quirk, I was surprised to see what should have been very a simple info-box to process.

Screen-shot of info-box for Duke Charles Louis Frederick of Mecklenburg

This person has an info-box with two individuals, each unambiguously listed as Father and Mother. Why was my code somehow interpreting the mother as the same individual as the father?

Investigating, I discovered that not only was Adolphus listed as someone’s mother, his actual mother was skipped over entirely. My data-structure simply didn’t have an entry for her.

To try and work out what was going on, I added a conditional breakpoint and looked as my code dutifully added her name to the queue of work, as well as later on when it was taken off the queue. The code downloaded her page as it disappeared into the parser. Yet the response that came back was that she was already accounted for. I beg to differ!

What I hadn’t done was click on her link. She didn’t have her own page, only a redirect to her husband’s page. Apparently, the only notable thing she had done, according to history, was marry her husband.

I later found a significant number of there links where a woman’s name is just a redirect to her husband. If the patriarchy isn’t going to allow me to rely on Mother/Father links as a sign of an individual’s parental role, investigating matrilineal lines will have to wait.

“We call our show, The Aristocrats!”

Acknowledgements and Notes

If you’d like to do your own analysis, I’ve saved the data I extracted into a JSON file you can download. I make no promises about its accuracy or completeness or indeed anything about the file. I’ve even hidden the word “Rutabaga” in there, just to make it clear how potentially inaccurate it is.

I showed a friend an earlier version of the chart and he wondered if I could do it better in Python. Maybe, but equally maybe not. This isn’t the C# of the early 2000s we’re dealing with. HtmlAgilityPack and LINQ combined can do very clever queries to extract data from web pages, often in single lines of code. Maybe there’s a Python component to do the same, I don’t know.

Rather than install GraphViz myself, I found online GraphViz did the job admirably and I’m grateful to them for hosting it. I’m also grateful to my friend Richard Heathfield for telling me about it several decades ago, back when I was thinking about building my own version control system. (Ah, to be young.)

RestSharp is a very nice component for downloading web content for processing. It flattens all the quirks of using the dot-net standard library directly and wraps it all up in a simple and consistent interface.

Oh, and here’s that Wikipedia edit, in all its glory. It was reverted around three minutes later by another editor but never mind.

Update: Hacker News discussion. Also, I am grateful to Denny Vrandečić for his analysis in response to this piece. I’ll be posting a more extensive response to all these soon.

Picture Credits:
📸 “Another batch of klutz” by “makeshiftlove”.
📸 “King Arthur statue in Winchester ” by “foundin_a_attic”.
📸 “</patriarchy>” by “Gaelx”.
📸 “Banana Muffins” by Richard Lewis.
📸 “River Seine” by Irene Steeves.

POP3 – The ideas that didn’t make it.

This is part of series of posts documenting extensions to the POP3 protocol.

I had a few ideas along the way. This post collects some that didn’t quite make it. I present these so the time I spent won’t have been a complete waste. 🐘

Multi-line Response Indication

Good software engineering employs reusable code.

If you’re developing a library to interact with a POP3 service as a client, you’d observe that the protocol operates on an exchange of command and response. This calls for a single function that can be called to send any command to the server and return the response when it arrives. Your function would look like:

var retrResponse = pop3.Command("RETR 4");

Except you can’t do that. There are two distinct classes of response in POP3. One where the response is a single line and another where the response is multiple lines. If all you have is the first line, you have no clear indication that’s the complete response or if there are more lines coming. You, the developer, need to know what kind of response you’re expecting from the server and have the caller pass that information along.

var retrResponse = pop3.CommandMulti("RETR 4");
var deleResponse = pop3.CommandSingle("DELE 4");

Wouldn’t it be nice if there was a clear an unambiguous way for a server to indicate if there are more lines to follow? That way, client code could have that single function that just knows what to do.

When the client calls CAPA, if the response includes “MULTI-LINE-IND”, the client can know what kind of response is coming from the server from the first line, because the server is making these promises:

  1. All multi-line responses will always have a first line that ends with a ” _”.
  2. All single-line responses will never be a line that ends with a ” _”.
C: RETR 1
S: +OK This line ends with an underscore so keep reading. _
(Message goes here.)
S: .
C: DELE 1
S: +OK This line has no underscore so send the next command.

I chose the underscore character as this would technically be encroaching into the human readable section of a response, so it would need to be ignorable by any humans passing by. I had flirted with using “…” as the indicator as it could be included in the text anyway, but that might not work for all languages. My inclination was to keep it as small as possible when displayed, printable ASCII, but also unlikely to be included in an English sentence.

The first issue I stumbled upon with this idea was that the underscore character could be included in the set of possible unique-ids. The command UIDL (n) returns a single line response with the message’s unique-id on the end as a single line response. Any servers implementing this idea would have to exclude underscores from their unique-ids.

The final nail in the coffin was when I took a step back and thought about the developers of POP3 client libraries. Would they make use of my extension?

No. Servers not implementing my new extension will still exist for a long time and people will still want to connect to those servers. As such, client libraries are still going to be passing a flag down to its command/response layer, indicating if the response is going to be multi-line or not. I won’t have saved the developer any effort.

My example implementation still includes this extension, for now.

Wait, there’s more!

Keepa your CAPA!

One thing that bothered me about reconnecting to a POP3 service was the necessity to call CAPA on reconnecting every time. Each time, the server would send the same response back. Wouldn’t it be nice if the client could store the response once and have some sort of notification if it needed to be checked?

My idea was to use the banner that the service sends immediately on connection, together with a new CAPA response.

S: Welcome to my POP3 service version 1.2.3.
C: CAPA
S: +OK Capabilities follow...
S: CAPA-VERSION 1.2.3.
S: UIDL
S: .

Because the CAPA response included this capability, the next time it connected, it could look in the connection banner and see the token “1.2.3.” included. With this, the client is assured that the response from last time is still good and the client need not ask for it again.

Even if the banner changed (which it would if it implemented APOP), so long as this one token was included, the response is still good.

I saw a problem that the response to a CAPA command might change in the course of a connection. The capabilities might be different after going through TLS. Different users might have different capabilities that only reveal themselves once you’ve logged in. Should the response to the PASS command also include a version token? What about other ways of logging in?

I tried to come up with RFC style wording that would have placed different CAPA responses into different domains, but it all got too complicated and too prone to error.

And for what? To save having to reissue a single command and brief response on top of all the TCP and TLS handshakes? Not worth it.

My demonstration implementation implements this capability.

Safe!

Pick up from where we left off.

The RETR command allows the client to download a message, while the TOP command allows the top part of a message. There’s a gap for a command that retrieves the end of a message. If you’ve downloaded part of a message but the connection broke, you could use this to resume from where you left off.

The END command would have worked in the same way as TOP. You select a message and the line you want to start from, and the server would continue from that point.

Selecting a line rather than a byte index was necessary. POP3 is line orientated protocol and any new commands would need to work within that restriction. If you selected a byte index instead, what if you wanted to start from the middle of a CRLF? What if the selected starting byte is a dot but in the middle of a line, should that be dot-padded?

I told myself that servers would need to keep track of what byte indexes each line started at, but that was an unsatisfactory answer.

It was about the same time as another idea. Email files are highly compressible thanks to their large blocks of base-64. I pictured an alternative form of RETR (RETZ?) that would return the +OK line normally, but the multi-line response would be inside a new GZIP stream. Inside that stream, the CRLF lines and dot-padding would still be present. A lone dot would complete the message as normal as the GZIP stream concludes and the normal uncompressed exchange of commands and responses continues.

It was at this point that I remembered there’s already a very well established protocol for downloading large files with compression, chunking, resumption and all that good stuff. HTTP. I’m still thinking about how that could work in practice but that road seems a lot more productive than trying to force HTTP into a POP3 shaped hole.

Picture Credits. (CC)
📷 “More Selvatiche” by Cristina Sanvito.
📷 “Fun with Cling Film” by Elizabeth Gomm.

Farewell, Hackensplat Industries!

2009, I registered “hackensplat.com”. A friend of mine called me “Wilhelm von Hackensplat” as a joke after my rather loud sneezes. It was about this time I decided to start writing about software development and technology. I liked the idea of having an alter ego. I pictured him as an evil genius, Baron von Hackensplat, and so hackensplat.com was born, an evil genius writing about his evil technologies.

That was the idea, but it never really took off in my mind. I would have an idea to write about something but it wouldn’t really lend itself to the evil genius persona. As time passed I would got bored with the alter ego and gave up writing for the character, instead just writing as myself. I later changed the name of the website to “Hackensplat Industries”, mainly so I could keep the name.

Even more time passed and I wrote a new piece that I wanted to show a friend. I read out the address as “hackensplat dot com”. My heart sank as the response came back “How do you spell that?”. A question I had been asked too many times before.

I almost registered hackandsplat.com as a redirect, but frankly I was over it. One of the reasons I was writing was to gain a little professional exposure but this other name was just getting in the way. I made the decision and started moving all my published posts to billpg.com, a domain I had previously used as my strictly personal website, distinct from my professional site. There wasn’t anything on my personal domain other than a collection of social media links anyway.

I don’t know how long I’ll keep the old domain, which now only has a set of redirects. It expires in November this year so I suspect I’ll be spending a little bit of October looking at access logs. Equally likely is that I’ll completely forget and it’ll automatically renew anyway.

Welcome to billpg industries.