What type of UUID should I use?

UUIDs, Universally Unique IDs, are handy 128 bit IDs. Their values are unique, universally, hence the name.

(If you work with Microsoft, you call them GUIDs. I do primarily think of them as GUIDs, but I’m going to stick with calling them UUIDs for this article, as I think that name is more common.)

These are useful for IDs. Thanks to their universal uniqueness, you could have a distributed set of machines, each producing their own IDs, without any co-ordination necessary, even completely disconnected from each other, without worrying about any of those IDs colliding.

When you look at a UUID value, it will usually be expressed in hex and (because reasons) in hyphen-separated groups of 8-4-4-4-12 digits.

7

You can tell which type of UUID it is by looking at the highlighted digit, the first of the middle of the four-digit blocks. That digit always tells you which type of UUID you’re looking at. This one is a type 7 because that hex-digit is a 7. If it was a 4 it would be a type 4.

As I write this, there are 8 types to chose from. But which type should you use? Type 7. Use type 7. If that’s all you came for, you can stop here. You ain’t going to need the others.

Type 7 – The one you actually want.

This type of UUID was designed for assigning IDs to records on database tables.

The main thing about type 7 is that the first block of bits are a time stamp. Since time always goes forward [citation needed] and the timestamp is right at the front, each UUID you generate will have a bigger value than the last one.

This is important for databases, as they are optimized for “ordered” IDs like this. To oversimplify it, each database table has an index tracking each record by its ID, allowing any particular record to be located quickly by flipping through the book until you get close to the one you wanted. The simplest place to add a new ID is to add it on the end and you can only do that if your new ID comes after all the previous ones. Adding a new record anywhere else will require that index to be reorganised to make space for that new one in the middle.

(You often see UUIDs criticised for being random and unordered, but that’s type 4. Don’t use type 4.)

The timestamp is 48 bits long and counts the number of milliseconds since the year 1970. This means we’re good until shortly after the year 10,000. Other than the 6 bits which are always fixed, the remaining 74 bits are randomness which is there so all the UUIDs created in the same millisecond will be different. (Except it is a little more complicated than that. Read the RFC.)

So there we are. Type 7 UUIDs rule, all other types drool. We done?

“I was born in a flame. Mama said that everyone would know my name. I’m the best you’ve ever had. If you think I’m burning out, I never am.”

Migrating from auto-incrementing IDs.

Suppose you have an established table with a 32-bit auto-incrementing integer primary key. You want to migrate to type 7 UUIDs but you still need to keep the old IDs working. A user might come along with a legacy integer ID and you still want to allow that request to keep working as it did before.

You could create a bulk of new type 7 UUIDs and build a new table that maps the legacy integer IDs to their new UUID. If that works for you, that’s great, but we can do without that table with a little bit of cleverness.

Let’s think about our requirements:

  1. We want to deterministically convert a legacy ID into its UUID.
  2. These UUIDs are in the same order as the original legacy IDs.
  3. New record’s UUIDs come after all the UUIDs for legacy records.
  4. We maintain the “universally unique”-ness of the IDs.

This is where we introduce type 8 UUIDs. The only rule of this type is that there are no rules. (Except they still have to be 128 bits and six of those bits must have fixed values. Okay, there are a few rules.) It is up to you how you construct this type of UUID.

Given our requirements, let’s sketch out how we want to layout the bits of these IDs.

The type 7 UUIDs all start with a 01 byte, until 2039 when they will start 02. They won’t ever start with a 00 byte. So to ensure these IDs are always before any new IDs, we’ll make the first four hex digits all zeros. The legacy 32-bit integer ID can be the next four bytes.

Because we want the UUIDs we create to be both deterministic and universally-unique, the remaining bits need to look random but not actually be random. Running a hash function over the ID and a fixed salt string will produce enough bits to fill in the remaining bits.

Now, to convert a legacy 32-bit ID into its equivalent UUID, we do the following:

  1. Start an array of bytes with two zero bytes.
  2. Append the four bytes of legacy ID, most significant byte first.
  3. Find the SHA of (“salt” + legacy ID) and append the first 10 bytes of the hash to the array.
  4. Overwrite the six fixed bits (in the hash area) to their required values.
  5. Put the 16 bytes you’ve collected into a UUID type.

And there we have it. When a user arrives with a legacy ID, we can deterministically turn it into its UUID without needing a mapping table or conversion service. Because of the initial zero bytes, these UUIDs will always come before the new type 7 UUIDs. Because the legacy ID bytes come next, the new UUIDs will maintain the same order as the legacy IDs. Because 74 bits come from a hash function with a salt as part of its input, universal-uniqueness is maintained.

What’s that? You need deterministic UUIDs but it isn’t as simple as dropping the bytes into place?

“You once thought of me as a white knight on his steed. Now you know how happy I can be.”

Deterministic UUIDs – Types 3 and 5.

These two types of UUID are the official deterministic types. If you have (say) a URL and you want to produce a UUID that represents that URL, these UUID types will do it. As long as you’re consistent with capital letters and character encoding, the same URL will always produce the same UUID.

The down-side of these types is that the UUID values don’t even try to be ordered, which is why I wrote the discussion of type 8 first. If the ordering of IDs is important, such as using them as primary keys, maybe think about doing it a different way.

Generation of these UUIDs work by hashing together a “namespace” UUID and the string you want to convert into a UUID. The hash algorithm is MD5 for type 3 or SHA1 for type 5. (In the case of SHA1, everything after the first 128 bits of hash are discarded.)

To use these UUIDs, suppose a user makes a request with a string value, you can turn that string into a deterministic UUID by running it through the generator function. That function will have two parameters, a namespace UUID (which could be a standard namespace or one you’ve invented) and the string to convert. That function will run the hash function over the input and return the result as a UUID.

These UUID types do the job they’re designed to do. Just as long as you’re okay with the values not being ordered.

Type 3 (MD5) or Type 5 (SHA1)?

There are pros and cons to each one.

MD5 is faster than SHA1. If you’re producing them in bulk, that may be a consideration.

MD5 is known to be vulnerable to collisions. If you have (say) a URL that hashes to a particular type 3 UUID, someone could construct a different URL that hashes to the same UUID. Is that a problem? If you’re the only one building these URLs that get hashed, then a hypothetical doer of evil isn’t going to get to have their bad URL injected in.

Remember, the point of a UUID is to be an ID, not something that security should be depending upon. Even the type 5 UUID throws away a big chunk of the bits produced, leaving only 122 bits behind.

If you want to hash something for security, use SHA256 or SHA3 and keep all the bits. Don’t use UUID as a convenient hashing function. That’s not what its for!

On balance, I would pick type 5. While type 3 is faster, the difference is trivial unless you’re producing IDs in bulk. You might think that MD5 collisions are impossible with the range of inputs you’re working with, but are you quite sure?

“I’ve seen this thing before, in my best friend and the boy next door. Fool for love and fool on fire.”

Type 4 – The elephant in the room

A type 4 UUID is one generated from 122 bits of cryptographic quality randomness. Almost all UUIDs you see out there will be of this type.

Don’t use these any more. Use type 7. If you’re the developer of a library that generates type 4 UUIDs, please switch it to generating type 7s instead.

Seriously, I looked for practical use cases for type 4 UUIDs. Everything I could come up was either better served by type 7, or both types came out as the same. I could not come up with a use-case where type 4 was actually better. (Please leave a comment if you have one.)

Except I did think of a couple of use-cases, but even then, you still don’t want to use type 4 UUIDs.

Don’t use UUIDs as secure tokens.

You shouldn’t use UUIDs as security tokens. They are designed to be IDs. If you want a security token, you almost certainly have a library that will produce them for you. The library that produces type 4 UUIDs uses one internally.

When you generate a type 4 UUID, six bits of randomness are thrown away in order to make it a valid UUID. It takes up the space of a 128 bit token but only has 122 bits of randomness.

Also, you’re stuck with those 122 bits. If you want more, you’d have to start joining them together. And you should want more – 256 bits is a common standard length for a reason.

But most of all, there’s a risk that whoever wrote the library that generates your UUIDs will read this article and push out a new version that generates type 7 UUIDs instead. Those do an even worse at being security tokens.

I’m sure they’d mention it in that library’s release notes but are you going to remember this detail? You just want to update this one library because a dependency needs the new version. You tested the new version and it all works fine but suddenly your service is producing really insecure tokens.

Maybe the developers of UUID libraries wouldn’t do that, precisely because of the possibility of misuse, but that’s even more reason to not use UUIDs as security tokens. We’re holding back progress!

In Conclusion…

Use type 7 UUIDs.

“Only to find the night-watchman, unaware of his presence in the building.”

Picture Credits.
📸 “Night Ranger…” by Doug Bowman. (Creative Commons)
📸 “Cat” by Adrian Scottow. (Creative Commons)
📸 “Cat-36” by Lynn Chan. (Creative Commons)
📸 “A random landscape on a random day” by Ivo Haerma (Creative Commons)
📸 “Elena” by my anonymous wife. (With Permission)

I want a less powerful programming language for Christmas.

I’m writing this because I’m hoping someone will respond, telling me that what I want already exists. I have a specific itch and my suspicion is that developing a whole programming language and runtime is the only way to scratch that itch.

Please tell me I’m wrong.

Dear Father Christmas…

If you’ve ever written a web service, you’ve almost certainly had situations where you’ve taken a bunch of bytes from a completely untrusted stranger and passed those bytes into a JSON parser. What’s more you’ll have done that without validating the bytes first.

Processing your inputs without sanitizing it first? Has Bobby Tables taught us nothing?

You can do this safely because that JSON parser will have been designed to be used in this manner and will be safe in the face of hostile inputs. If you did try feeding the bytes of an EXE file into a JSON parser, it’ll very quickly reject it complaining that “MZ” isn’t an opening brace and refuse to continue beyond that. The worst a hostile user could do is put rude messages inside the JSON strings.

{ "You": "A complete \uD83D\uDC18 head!" }

Now take that idea and think about what if you did have a web service where completely unauthenticated users could use any request body they liked and your service would run that request body in a copy of Python as the program source code.

Hopefully, you’ve just now remarked that it would be a very bad idea, up there with Napoleon’s idea to make his brother the King of Spain. But that’s exactly what I want to do. I want to write a web service that accepts Python code from complete strangers and actually run that code.

(And also make my brother the King of Spain. He’d be great!)

“Hang on to your hopes, my friend. That’s an easy thing to say. But if your hopes should pass away, simply pretend that you can build them again.”

At the gates of dawn

Some time in the early 90s, I had a game called “C Robots”.

This is a game where four tanks are in an arena, driving around and firing missiles at each other. But instead of humans controlling those tanks, each tank was controlled by a program written by the human player. The game controller would keep track of each tank and any missiles in flight, passing back control to each tank’s controller program to let it decide what its next move will be.

For 90s me, programming a robot appealed to me but the tank battle part did not appeal so much. I really wanted to make a robot to play other games that might not involve tanks. At the time, there were two games I enjoyed playing with school friends, Dots-and-Boxes and Rummy. I had an idea of what made good strategies for these specific games, so I thought building those strategies into code might make for a good intellectual exercise.

Decades passed and I built a simple game controller system which I (rather pompously) called “Tourk“. I had made a start on the controllers for a handful of games but I hadn’t gotten around to actually writing actual competitive players, only simple random ones that were good for testing. I imagined that before long, people would write their own players, send them in to me and I’d compile them all together. After I’d let it ran for a million games in a tournament I’d announce the winner.

If anyone had actually written a player and sent it in, my first step would have been to inspect the submitted code thoroughly. These would have been actual C programs and could have done anything a C program could do, including dropping viruses on my hard disk, so inspecting that code would have been very important. Looking back, I’m glad no-one actually did that.

But this was one thing C Robots got right, even if it wasn’t planned that way. Once it compiled the player’s C code, it would run that code in a restricted runtime. Your player code could never go outside its bounds because there’s no instructions in the C Robots runtime to do that. This meant that no-one could use this as an attack vector. (But don’t quote me on that. I’ve not actually audited the code.)

“I never ever ask where do you go. I never ever ask what do you do. I never ever ask what’s in your mind. I never ever ask if you’ll be mine.”

Will the runtime do it?

Could maybe the dot-net runtime or the Python runtime have the answer?

This was one of the first questions I asked on the (then) new Stack Overflow. The answer sent me to Microsoft’s page on “Code Access Security” and if you follow that link now, it says this feature is no longer supported.

Wondering more recently if Python might have an option to do what I wanted, I asked on Hacker News if there was a way to run Python in the way I wanted. There were a few comments but it didn’t get enough up-votes and disappeared fairly quickly. What little discussion we had was more to do with a side issue than the actual question I was asking.

I do feel that the answer might still be here. There’s quite possibly some flag on the runtime that will make any call to an extern function impossible. The Python runtime without the “os” package would seem to get 90% of the way there, but I don’t know enough about it to be certain enough that this won’t have left any holes open.

“We’re all someone’s daughter. We’re all someone’s son.”

Sanitize Your inputs?

Maybe I should listen to Bobby Tables and sanitize my inputs before running them.

Keep the unrestricted runtime, but before we invoke it to run the potentially hostile code, scan it to check it won’t do any bad things.

Simple arithmetic in a loop? That’s fine.
Running a remote access trojan? No.

Once it has passed the test, you should be able to allow the code to run, confident it won’t do anything bad because you’ve already checked it won’t. This approach appeals to me because once that initial test has passed the code for non-hostility, we can allow the runtime to go at full speed.

The problem with this approach are all the edge cases and finding that line between simple arithmetic and remote-access-trojans. You need to allow enough for the actually-not-hostile code to do useful things, but not enough that a hostile user could exploit.

Joining strings together is fine but passing that string into eval is not.
Writing text to stdout is fine but writing into a network socket is not.

Finding that line is going to be difficult. The best approach would be to start with nothing-is-allowed, but when considering what to add, first investigate what would be possible by adding that facility to allowed list. Because it can be used for bad things, eval would never be on that allowed list.

If there’s a function with a million useful things it can do but one bad thing, that function must never be allowed.

“We can go where we want to. A place they’ll never find. We can act like we come from out of this world and leave the real one far behind.”

Ask the Operating System?

I told a colleague about this post while I was still writing it and he mentioned that operating systems can have restrictions placed on programs it runs. He showed me his Mac and there was a utility that listed all the apps he was running and all the permissions it had. It reminded me that my Android phone does something similar. If any apps wants to interact with anything outside its realm, it has to ask first. This is why I’m happy to install apps on my Android phone but not on my Windows laptop.

This would be great, but how do I, a numpty developer, harness this power? What do I do if I want to launch a process (such as the Python runtime) but with all the permissions turned off? It feels like this will be the solution but my searching isn’t coming up with a practical answer.

My hope is that there’s a code library whose job it is to launch processes in this super restricted mode. It’ll work out which OS it is running on, do the necessary magic OS calls and finally launch the process in that super-restricted mode.

“If I was an astronaut I’d be floating in mid air. A broken heart would just belong to someone else down there. I would be the centre of my lonely universe. I’m only human and I’m crashing in the dark.”

Mmmm coffee!

The good people developing web browsers back in the 90s had the same need as me. They wanting to add a little interactivity to web pages, but without having to wait for a round trip back to the server over dialup, so they came up with a language they named JS.

As you read this page, your browser is running some code I supplied to you. That code can’t open up your files on your local device. If anyone did actually find a way to do that, the browser developers would call that a serious bug and push out an emergency update. So could JS be the solution I’m looking for?

As much as it sounds perfect, that JS runtime is inside the browser. If I have some JS code in my server process, how do I get that code into a browser process? Can I even run a web browser on a server without some sort of desktop environment?

The only project I know of where someone has taken JS outside of a browser is node-js. That might be the answer but I have written programs using node-js that load and save files. If this is the answer then I’d need to know how to configure the runtime to run the way I want.

“Play the game, fight the fight, but what’s the point on a beautiful night? Arm in arm, hand in hand. We all stand together.”

Is there an answer?

I began this post expressing my suspicion that the solution is to write my own runtime, designed from first-principles to run in a default-deny mode. I still wonder if that’s the case. I hope someone will read this post and maybe comment with the unknown option on the Python runtime that does exactly what I want.

In the meantime, I have another post in the works as with my thoughts on how this runtime and programming language could work. I hope I can skip it.

Gronda-Gronda.

Picture Credits
📸 “Snow Scot” by Peeja. (With permission.)
📸 “Meeting a Robot” by my anonymous wife. (With permission)
📸 “Great Dane floppy ears” by Sheila Sund. (Creative Commons)
📸 “Fun with cling film” by Elizabeth Gomm. (Creative Commons)
📸 “Rutabaga Ball 2” by Terrence McNally. (Creative Commons)
📸 “Nice day for blowing the cobwebs off” by Jurassic Snark. (With permission.)

(And just in case advocating for your brother to be made King of Spain is treason or something, I don’t actually want to do that. It was a joke.)

My adventure into self web-hosting (Part 1)

If you had asked twenty-something me how he thought forty-something me would be hosting his website, he’d have predicted I had a rack of small servers in my attic, as part of a grid-computing business. (That’s what we called “cloud” computing back then.)

He’d have been disappointed to find out I’m using a shared web-hosting service, but that may change.

“The end of the day, remember the way, we stayed so close to the end, we’ll remember it was me and you ’cause we are gonna be…”

Over the Cliff

It all started when my article, Data-Mining Wikipedia for Fun and Profit made it to the top of Hacker News and stayed there for three hours. I was careful to try to not overburden the system by switching on an HTML cache. This way, visitors would only be served up static files without the server having to run the PHP code or talk to the database. Despite that, the server went down and I had to post a sheepish comment with a link to a mirror.

It was clear I was out-growing my current web-host. Despite my precautions, it couldn’t handle being popular for a few hours. Not only that, I’m a software developer and I wanted to develop software. The only practical choice on this service was PHP and I had long decided that life was too short for that.

I started looking at VM services as the natural next step on the ladder, but it was a chance discussion, again on Hacker News, that gave me an idea.

Clifford Stoll: “a heavy load on my raspberry-pi web server told me something was happening…”
Me: “your web server is a Raspberry PI, and its holding up while being on the HN front page?”
CS: “Hi Bill, Yep. Cloudflare is out front, so the actual load on the rasp-pi is mitigated by their content-delivery network.”

Suddenly, the idea of hosting a web server in my attic became real again. Reality had long since taught me that residential ISPs were no good for serious web hosting – but if there was a service that could deal with the bulk of GET requests and it could cover the occasional outage on my side from its cache, that’d change everything.

“Can you deal with my GET requests?”

Tunnelling

At the time, that Raspberry-Pi web server was on his residential ISP with a public IP address. That arrangement wouldn’t work for me as my own ISP didn’t allow their customers to run services like that. However, in that same comment thread, the very CTO of Cloudflare (John Graham-Cumming) mentioned to him that they had an new service that allowed their customers to VPN out to Cloudflare, making such port-forwarding shenanigans a thing of the past.

(As a not-quite a declaration of bias, Cloudflare are on my list of companies I would like to work for should my current day-job come to end. I am not (yet) an employee of Cloudflare and they’re not paying me to write this in any case. By the time you come to read this, that might have changed.)

This service is completely free. While I like not having to pay for things, it does make me a little nervous. This particular service isn’t going to be injecting ads into my site and I do understand how the free tier fits into their business model. But still, I’ve been burnt by free services suddenly disappearing before and you get no sympathy if you’ve become dependent on them. I kind of wish I could give them a few pounds each month, just in case.

Leaving such concerns to one side, I had a plan. Acquire a server and install it into one of the slots on my IKEA KALLAX unit the TV is sitting on. Plug it into my ISP’s router and once that’s running, install a web server along with the VPN software. I’ll finally be in charge of my very own web server, just like the twenty-something me thought I’d be.

“If I get to know your name, well I could trace your private number, baby. All I know is that to me, you look like you’re lots of fun. Open up your loving arms, I want some, want some. You spin me right round, baby, right round, like a record, baby, right round…”

Quiet!

I had acquired a second-hand PC for this purpose but once I got it home it was way too noisy. I needed a machine I could leave switched on 24/7 in the lounge where we watch TV. My server would have to be really quiet.

I also considered a Raspberry Pi, the same hardware Clifford Stoll used, but I wasn’t going to only be running a few WordPress instances. I had an idea I wanted to develop and I’d need a database with plenty of space for that to work. An SD card and maybe some USB storage wouldn’t cut it.

I’m not in particular hurry to buy it as I still want to plan some more before the new machine starts taking up space. It was while I was reading reviews for various machines when I had the craziest of crazy ideas.

“And as we sit here alone, looking for a reason to go on. It’s so clear that all we have now are our thoughts of yesterday. La, la la la…”

It comes with Windows

Any PC I could buy is going to come with Windows pre-installed and fully licensed. I was always going to replace it with a variety of Linux, but I wondered, why not keep the copy Windows?

Before you all think I’ve gone insane, there are a few benefits to doing it this way. I use Windows a lot for my day job so I’m familiar with its quirks and gotchas. Even though there’s a dot-net for Linux, my development machine runs Windows so there would be fewer surprises when the development machine runs the same OS as the production machine. For the handful of WordPress sites I wanted to run, there were docker images available. Finally, because it won’t be directly connected to the scary internet I wouldn’t have to panic when there’s an update.

But even as I’m writing this, I feel I’m going to regret doing it this way. I just know I’ll be writing part six of this series and it’ll be all about installing Linux on that server machine because there’s just one stupid thing I couldn’t get working on Windows. We shall see.

A foreshadowing?

Join me for part 2 of this series, where I’ll be experimenting with getting WordPress running from a Docker container. Wish me luck.

Picture Credits:
📸 “Kee-kaws”, by me.
📸 “Duke”, by my anonymous wife.
📸 “Haven Seafront, Great Yarmouth”, by me.
📸 “Quiet Couple” by Judith Jackson. (CC)
📸 “Blisworth Canal Festival, 2019”, by me.

Data-Mining Wikipedia for Fun and Profit

It all started after watching one too many videos narrating the English monarchy, all starting from King William Ⅰ in 1066 as if he’s the first king of England. This annoys me as it completely disregards the handful of Anglo-Saxon kings of England who reigned before the Normans.

They’re Kings of England. If you’re going to make a list of the Kings of England, then you should include the Kings of England.

It was this that made me want to make a particular edit to both the King Alfred and Queen Elizabeth pages on Wikipedia, acknowledging each as related to the other. But what is their relationship and through who?

I went to the page for Queen Elizabeth Ⅱ and started following the Mother/Father links until I found my way to King Alfred, mostly going through the other kings of England. I counted 36 generations, but was there a shorter or even longer route?

Sounds like a job for some software!

Gâteau Brûlé.

Scanning Wikipedia

We have the technology.

  • Visual Studio 2019 and C#.
  • RestSharp, a library for downloading HTML.
  • HtmlAgilityPack, a library for parsing and extracting data from HTML.

With these libraries downloaded from nuget, I was able to write some very quick and dirty code that would download the HTML for the Wikipedia page of Queen Elizabeth II, storing the HTML in a cache folder to save re-downloading it again.

Once the HTML is downloaded (or read from the cache), HtmlAgilityPack can be called upon for the task of pulling items of data from the HTML. For example, the person’s full name, which is always the page’s only <H1>…</H1> element, can be extracted using one line of code:

string personName = 
    html
    .DocumentNode
    .Descendants()
    .Where(h => h.Name == "h1")
    .Single()
    .InnerText;

I used HtmlAgilityPack and LINQ in a similar way to pull out the Mother and Father for each person. The code would look for the info-box <TABLE>, then look inside for a <TH> with the text “Mother” or “Father”. It would then take a few steps backwards to look for the <TR> that the text is a part of and finally pull out all the links it can find inside.

With the links to the Queen Elizabeth’s mother and father, the code would add those links to a queue and the top-level would pull the next link and continue until the links runs out.

Calm down!

This section was added after initial publication.

I would hope that people don’t need to be told to be considerate, but please be considerate.

Before I started on this project, I checked Wikipedia’s robots.txt file. This told me that my project was acceptable, quoth: “Friendly, low-speed bots are welcome viewing article pages, but not dynamically-generated pages please.”

The article pages were exactly what I wanted. My code was already fairly low speed as it was all in a single thread. Nonetheless, I added a short delay after each download once I had worked the kinks out. I also set the User-Agent text to include my email address and phone number so Wikipedia server admins could raise an alarm with me personally if necessary.

As I was running my code in Visual Studio’s debug mode, I could leave the code running unattended (once I had observed it over the first hundred or so) with some breakpoints to stop everything until I could return to inspect what happened.

The most important were during examination of the response from Wikipedia. If the response was anything other than an 200/OK response (after redirects) or anything other than HTML, I wanted my code to stop dead until I can inspect what happened. Even if it happened overnight, I still what that response object in memory.

In the end, the bulk of the download took two days in a number of bursts. I’ll be sending a modest donation to the Wikimedia Foundation in thanks for accommodating my bizarre projects.

“She’s just a girl who says that I am the one…”

I made the decision here to only include people with an info-box. Extracting someone’s parents from free English text was a step too far. If you’re not notable enough to have an info-box with your parents listed, you’re not notable enough for this project. (Although I did find a couple of people who didn’t have a suitable info-box surprisingly early in the process. Rather than hack in an exception, I edited Wikipedia to include those people’s parents in their info-box, copying the link from elsewhere in the text.)

While that got me out of a small hole, more annoying was when the info-box listed “Parents” or “Parent(s)” instead of Mother and Father. I wanted to track matrilineal and patrilineal lines, so it was a little annoying to just have an individual’s parents with no clear indication of which one is which. I coded it so that if there’s only one one link, assume it is the father. If there’s two links, assume the father is the first one.

Because patriarchy.

“Also known as…”

Another issue was that some of the pages changed names. RestSharp would dutifully follow HTTP redirects, but I’d end up storing a page with one name but having a different name internally. This happened right away as the page for Queen Elizabeth links to her mother as “Elizabeth_Bowes-Lyon“, but once you follow the link, you end up at “Queen_Elizabeth_The_Queen_Mother“.

The HTML included a <LINK> tag named the “canonical reference”, so I could pull that out and use it as the primary key in my data structure. To keep the link between child and parent, it collects the aliases when the are detected and a quick reconciliation loop corrects the links after the initial loop completes.

King Alfred, also known as The Muffin Man.

From Alfred to Elizabeth.

Once I had a complete set of Wikipedia pages cached, the next step was to build a tree with all of the parental connections that lead from King Alfred to Queen Elizabeth. I knew that some non-people had crept in because someone’s parents would be listed as “(name) of (town)”, but that didn’t bother me as those towns wouldn’t have a mother or father listed and those loose ends would be discarded.

I wrote some code to walk the tree of connections. It started from Queen Elizabeth and recursively walked to each of the mother and father node. If a node ended on King Alfred, the complete chain would be added to the list of nodes.

With this reduced set in place, I churned through the nodes and generated a GraphViz file. For those who don’t know about it, this an app for producing graphs of connected bubbles. You tell it what bubbles you want and how they are connected and it automatically lays them out.

At this point, I was expecting a graph that would be mainly tall and thin and it would appear right here in this article. While family trees do grow exponentially, I wasn’t including every single relationship, only those that connect both of two individuals. If I were graphing the relationships between myself an a distant ancestor, I’d expect a single line, each parent handing over to their child. There would be a few bulges when third-or-so cousins marry. There, an individual’s two children would split off into separate lines, eventually reuniting with one ever-so-slightly inbred individual.

Yeah, that’s not what I got. This is the SVG file GraphViz generated for me. If you follow this link and are faced with a blank screen, scroll right until you find the King Alfred node. Then zoom out.

Aristocrats…

(The bubbles are all clickable, by the way.)

Count the Generations.

The graph was interesting but this wasn’t the primary objective of this exercise. I wanted to write “He is the n-times great-father of his current successor Queen Elizabeth.” on King Alfred’s Wikipedia page.

But what’s the n? I already had a collection of all the chains between so I just had to loop through them to find the longest and shortest chain. The longest chain has 45 links and the shortest chain has 31 links.

King Alfred is a 42-times great-grandfather of Queen Elizabeth Ⅱ.

(And also 28 times-great-grandfather. And everything in between.)

Here’s the simplified graph showing only those lines with exactly 45 links.

All the parental chains from Alfred to Elizabeth that have exactly 45 links.

“Let’s talk about sex.”

Earlier, I mentioned being annoyed that some info-boxes listed two parents instead of a mother and a father, requiring me to make assumptions that fathers are more likely to be included and put first, because these are aristocrats and society is quite patriarchal.

I still wanted to data-mine into matrilineal lines, so to check on those assumptions, I pulled out all of the people linked only in a “Parents” line of the info-box and checked they were all in order. The fathers all had manly names and the mothers all had womanly names. Seemed fine. But just to be sure, I queried my data structure for any individual that was listed as both a mother and a father, expecting that to happen from two different children’s pages.

There were several. Not only that, the contradicting links came from the same page. Someone apparently had the same individual as both his father and mother. Expecting to see the same person linked twice or a similar variety of quirk, I was surprised to see what should have been very a simple info-box to process.

Screen-shot of info-box for Duke Charles Louis Frederick of Mecklenburg

This person has an info-box with two individuals, each unambiguously listed as Father and Mother. Why was my code somehow interpreting the mother as the same individual as the father?

Investigating, I discovered that not only was Adolphus listed as someone’s mother, his actual mother was skipped over entirely. My data-structure simply didn’t have an entry for her.

To try and work out what was going on, I added a conditional breakpoint and looked as my code dutifully added her name to the queue of work, as well as later on when it was taken off the queue. The code downloaded her page as it disappeared into the parser. Yet the response that came back was that she was already accounted for. I beg to differ!

What I hadn’t done was click on her link. She didn’t have her own page, only a redirect to her husband’s page. Apparently, the only notable thing she had done, according to history, was marry her husband.

I later found a significant number of there links where a woman’s name is just a redirect to her husband. If the patriarchy isn’t going to allow me to rely on Mother/Father links as a sign of an individual’s parental role, investigating matrilineal lines will have to wait.

“We call our show, The Aristocrats!”

Acknowledgements and Notes

If you’d like to do your own analysis, I’ve saved the data I extracted into a JSON file you can download. I make no promises about its accuracy or completeness or indeed anything about the file. I’ve even hidden the word “Rutabaga” in there, just to make it clear how potentially inaccurate it is.

I showed a friend an earlier version of the chart and he wondered if I could do it better in Python. Maybe, but equally maybe not. This isn’t the C# of the early 2000s we’re dealing with. HtmlAgilityPack and LINQ combined can do very clever queries to extract data from web pages, often in single lines of code. Maybe there’s a Python component to do the same, I don’t know.

Rather than install GraphViz myself, I found online GraphViz did the job admirably and I’m grateful to them for hosting it. I’m also grateful to my friend Richard Heathfield for telling me about it several decades ago, back when I was thinking about building my own version control system. (Ah, to be young.)

RestSharp is a very nice component for downloading web content for processing. It flattens all the quirks of using the dot-net standard library directly and wraps it all up in a simple and consistent interface.

Oh, and here’s that Wikipedia edit, in all its glory. It was reverted around three minutes later by another editor but never mind.

Update: Hacker News discussion. Also, I am grateful to Denny Vrandečić for his analysis in response to this piece. I’ll be posting a more extensive response to all these soon.

Picture Credits:
📸 “Another batch of klutz” by “makeshiftlove”.
📸 “King Arthur statue in Winchester ” by “foundin_a_attic”.
📸 “</patriarchy>” by “Gaelx”.
📸 “Banana Muffins” by Richard Lewis.
📸 “River Seine” by Irene Steeves.

POP3 – Goodbye to numeric message IDs!

This is the second post in my series describing a number of extensions to the POP3 protocol. The main one is a mechanism to refresh an already opened connection to allow newly arrived messages to be downloaded, which I’ve described on a separate post. This one is a lot simpler in scope but if we’re doing work in this protocol anyway, this may as well come as a package.

I am grateful to the authors of POP4 for the original idea.

This post is part of series on POP3. See my project page for more.

Enumeration

To recap, when a client wishes to interact with a mailbox, it first needs to send a UIDL command to retrieve a list of messages in the form of pairs of message-id integers and unique-id strings. (I’ve written before how UIDL really should be considered a required command for both client and server.)

C: UIDL
S: +OK Unique-ids follow...
S: 1 AAA
S: 2 AAB
S: 3 AAJ
S: .

The numeric message-ids are only valid for the lifetime of this connection while the string unique-ids are persistent between connections. All of the commands (prior to this extension) that deal with messages use the numeric message-ids, requiring the client to store the UIDL response so it has a map from unique-id to message-id.

This extension allows the client to disregard the message-ids entirely, modifying all commands that have a message-id parameter (RETR, TOP, DELE, LIST, UIDL) to use a unique-id parameter instead.

“UID:”

If the server lists UID-PARAM in its CAPA response, the client is permitted to use this alternative form of referencing a message. If a message-id parameter to a command is all numeric, the server will interpret that parameter as a numeric message-id as it always has done. If the parameter instead begins with the four characters “UID:”, the parameter is a reference to a message by its unique-id instead.

C: DELE 1
S: +OK Message #1 (UID:AAA) flagged for deletion.
C: DELE UID:AAB
S: +OK Message #2 (UID:AAB) flagged for deletion. 

(The POP4 proposal used a hyphen to indicate the parameter was a unique-id reference. I decided against adopting this as it could be confused for a negative number, as if numeric message-ids extended into the negative number space. A prefix is a clear indication we’re no longer in realm of numeric identifiers and may allow other prefixes in future.)

If a client has multiple connections to a single mailbox, it would normally need to perform a UIDL command and store the response for each connection separately. If the server supports unique-id parameters, the client is permitted to skip the UIDL command unless it needs a fresh directory listing. Additionally, the client is able to use a multiple connections without having to store the potentially different unique-id/message-id maps for each connection.

RFC 1939 requires that unique-ids are made of “printable” ASCII characters, 33 to 126. As the space (32) is explicitly excluded, there is no ambiguity where a unique-id parameter ends, either with a space (such as with TOP) or at the end of the line.

Not-Found?

If a requested unique-id is not present, the server will need to respond with a “-ERR” response. To allow the client to be sure the error is due to a bad unique-id rather than any other error, the error response should inside a [UID] response code. (The CAPA response should also include RESP-CODES.)

S: RETR UID:ABC
C: -ERR [UID] No such message with UID:ABC.
S: RETR UID:ABD
C: -ERR [SYS/PERM] A slightly more fundamental error.

It should be noted that a [UID] error might not necessarily mean the message with this unique-id has been deleted. If a new message has arrived since this particular connection opened, the server may or may not be ready to respond to requests for that message. A client should only make the determination that a message has gone only if it can confirm it with either a new or refreshed connection.

Extensions yet to come

I’ve pondered about if this extension, once its written up in formal RFC language, should modify any future extensions that use message-id parameters. Suppose next year, someone writes a new RFC without having read mine that adds a new command “RUTA” that rutabagas the message specified on its command line.

(What? To rutabaga isn’t a verb? Get that heckler out of here!)

The wording could be: “Any command available on a service that advertises this capability in its CAPA response, that accepts a message-id parameter that is bounded on both sides by either the space character or CRLF, and normally only allows numeric values in this position, MUST allow a uid-colon-unique-id alternative in place of the message=id parameter.”

(In other words, this capability, only changes commands where a unique-id with a prefix can unambiguously be distinguished from a numeric message-id.

My inclination is for the RFC defining this capability exhaustively lists the commands it modifies to just the ones we know about. (RETR, TOP, DELE, LIST, UIDL.) I would add a note that strongly encourages authors of future extensions to allow UID: parameters as part of their standard. If someone does add a RUTA command without such a note, then strictly speaking, the client shouldn’t try and use a UID: parameter with the RUTA command, but probably will.

I’m on the fence. What do you think?

Restrictions

RFC 1939 that defines POP3, makes a couple of allowances with the UIDL command that would make UID: parameters problematic. A server is allowed to reuse unique-ids in a single mailbox, but only if the contents of two messages are identical. A server is also allowed to reuse a unique-id once the original message using that unique-id has been deleted.

Since these allowances would introduce a complication to which message is being referenced, any server advertising this capability (in RFC language) MUST NOT exercise these two allowances. If a server advertises UID parameters, it is also promising that its unique-ids really are unique.

Fortunately, all mail servers I’ve looked at can already make this promise, either they use a hash but add their own “Received:” header or they assign an incrementing ID to each incomming message.

billpg.com

Resources
New extension: Mailbox Refresh Mechanism
New extension: Delete Immediately
Prototype POP3 Service

POP3 – A Commit and Refresh Mechanism

POP3 is a popular protocol for accessing a mailbox full of email messages. While small devices have moved their mail reading apps to IMAP and proprietary protocols, POP3 remains the preferred protocol for moving email messages between big servers where a no-frills, download-and-delete system is preferred.

A problem with this protocol is embodied in this question: “How often should we poll for new messages?” There’s a non-trivial overhead to connecting. Poll too quickly and you overload the system. Poll too far apart and messages take too long to arrive.

Over my head!

To recap, let’s take a look at what needs to happen every time a POP3 client wants to check for new messages.

  • The client and server handshake TCP as the underlying connection.
  • The client and server handshake TLS for security.
  • The client authenticates itself to the server.
  • The client finally gets to ask if there are any new messages.

“Oh, no new messages? Okay, I’ll go through all that again in five seconds.”

POP3 doesn’t have a way to avoid this continual opening and closing of connections, but it does have a mechanism to add extensions to the protocol. All it needs is for someone to write the new extension down and to develop a working prototype. Which I have done.

“I have no time for your prototype!”

billpg industries POP3 service

On my github account, you’ll find a prototype POP3 server that implements this extension. Download it, compile it, run it. Go nuts. The service is written using the “listener” model. You set it up to listen for incomming connections and it talks the protocol until you shut it down. The library deals with the complexities while requests for messages are passed onto your provider code.

You don’t need to write that provider code if you only want to try it out. I’ve written a basic Windows app where you can type in new messages into a form, ready for the client to connect and download them. If Linux is more your thing or you prefer your test apps to work autonomously, I’ve also written a command-line app that sits there randomly populating a mailbox with new messages, waiting for a client to come along and get them.

To Sleep, and Goodnight…

Now for the new extension itself. It comes in two parts, the SLEE command put a connection to sleep and the WAKE command brings it back.

If a server normally locks a mailbox while a connection is open, then SLEE should release that lock. In fact, SLEE is defined to do everything that QUIT does, except actually close down the connection. Crucially, this includes committing any messages deleted with DELE.

During a sleeping state, you’re no longer attached to the mailbox. None of the normal commands work. You can only NOOP to keep the underlying connection alive, QUIT to shut it down or WAKE to reconnect with your mailbox.

If the server responds to WAKE with a +OK response, a new session has begun. The refreshed connection needs to be viewed as if it is a new connection, as if the client had QUIT and reconnected. The numeric message IDs from before will now be invalid and so the client will need to send a new STAT or UIDL command to update them.

In order to save the client additional effort, the server should include a new response code in the +OK response.

  • [ACTIVITY/NEW] indicates there is are new messages in the mailbox that were not accessible in the earlier session on this connection.
  • [ACTIVITY/NONE] indicates there are no new messages this time, but it does serve as an indication that this server has actively checked and that it is not necessary for the client to send a command to check.
  • (No “ACTIVITY” response indicates the server is not performing this test and the client will need to send a STAT or UIDL command to retrieve this information.)

A server might give an error response to a WAKE command, which may include a brief error message. In this situation where a connection can’t be refreshed for whatever reason, a client might chose to close the underlying connection and open a new one.

The error might include the response code [IN-USE] to indicate that someone else is connected to the mailbox, or [AUTH] to indicate that the credentials presented earlier are no longer acceptable.

Note that the SLEE command is required to include a commit of DELE commands made. If the client does not want the server to commit message deletes, it should send an RSET command first to clear those out.

How would a client use this?

To help unpack this protocol extension, here is description of how the process would work in practice with a server that implements this extension.

The client connects to the server for the first time. It sends a CAPA request and the response includes SLEE-WAKE, which means this connection may be pooled later on.

S: +OK Welcome to my POP3 service!
C: CAPA
S: +OK Capabilities follow...
S: UIDL
S: RESP-CODES
S: SLEE-WAKE
S: .

The client successfully logs in and performs a UIDL which reveals three messages ready to be downloaded. It successfully RETRs and DELEs each message, one by one.

C: USER me@example.com
S: +OK Send password.
C: PASS passw0rd
S: +OK Logged in. You have three messages.
C: UIDL
(Remainder of normal POP3 session redacted for brevity.) 

Having successfully downloaded and flagged those three messages for deletion, the client now sends a SLEE command to commit those three DELE commands sent earlier. The response acknowledges that the deleted messages have finally gone and the connection has entered a sleeping state.

C: SLEE
S: +OK Deleted 3 messages. Sleeping.

The client can now put the opened connection in a pool of opened connections until needed later. It is not a problem if the underlying connection is closed without ceremony in this state, but it may be prudent for the pool manager to periodically send a NOOP command to keep the connection alive and to detect if any connections have since been dropped.

C: NOOP
S: +OK Noop to you too!

Time passes and the client wishes to poll the mailbox for new messages. It looks in the pool for an opened connection to this mailbox and takes the one opened earlier. It sends a WAKE command to refresh the mailbox.

C: WAKE
S: +OK [ACTIVITY/NONE] No new messages.

Because the server supports the ACTIVITY response code and the client recognized it, the client immediately knows that there is nothing left to do. The client immediately sends a second SLEE command to put right back into the sleeping state.

C: SLEE
S: +OK Deleted 0 messages. Sleeping.

(Incidentally, the client is free to ignore the “ACTIVITY” response code and instead send a STAT or UIDL command to make its own conclusion. It will need to do this anyway if the server does not include such a response code.)

More time passes and the client is ready to poll the mailbox again. As before, it finds a suitable opened connection in the pool and it sends another WAKE command.

C: WAKE
S: +OK [ACTIVITY/NEW] You've got mail.

Having observed the notification of new mail, the client sends a new sequence of normal POP3 commands.

This was all within one connection. All the additional resources needed to repeatedly open and close TCP and TLS are no longer needed.

billpg.com

I’ll be doing the job of turning all this informal text into a formal RFC on my github project.

Picture Credits:
📷 Keyboard Painting Prototype by Rodolphe Courtier.

POP3 – UIDL is a required command!

RFC 1957 observes, discussing mail reading software that implements the popular POP3 protocol: “two popular clients require optional parts of the RFC. Netscape requires UIDL, and Eudora requires TOP.”

This reads like a complaint, but this tell me that Netscape’s mail reader (which these days is called Thunderbird) is well designed.

The rot started with  RFC 1939, the standard for this protocol. This document specifies that UIDL is optional. This was a mistake. Without UIDL, the protocol is not reliable. I write this in the hope of persuading you that UIDL should not only be considered a requirement for a POP3 server, but that any client software that doesn’t require UIDL should not be trusted. I’m looking at you, Eudora!

What is UIDL and how does it fit into POP3?

UIDL is the “directory listing” command in POP3. When a client issues this request, the server responds with a list of “unique-id” strings that may as well be considered file names.

Opening a POP3 connection, authenticating and performing a “directory listing”.

Each unique-id is paired with a numeric id, starting from 1. The other commands to download and delete messages all use these numeric ids. Each time the client reconnects, it will need to repeat the UIDL command so it knows which numeric ids refer to which messages.

For something as fundamental as a directory listing, it seems odd for that to be optional.

Without UIDL, the client needs to fall-back onto those numeric message ids alone. Instead of UIDL, the STAT command returns the number of messages in a mailbox. With that, the client can loop from 1 to n, downloading and deleting each one, leaving the mailbox empty once they have all been downloaded. As POP3 is explicitly designed for download-and-delete operation and not keeping the messages on the server, you might consider that UIDL is not necessary. So let us follow that road where we don’t have UIDL.

Living in a world without UIDL.

Operating POP3 without UIDL only works in an ideal world. If you had 100% reliable connections to the server then you might get away with it. Reality tells us the world is not ideal.

Let’s think about the step of deleting a message once you’ve downloaded it. You might think that DELE is the request to delete messages you’ve downloaded (or don’t want), but the request to actually delete messages is QUIT.

The client flags the messages to delete with DELE, but those deletes aren’t committed until the client later issues a QUIT request. If the connection stops before a QUIT, the server has to forget about those DELE commands and the messages all have to remain in the mailbox for when you reconnect. This is by design as you wouldn’t want your messages deleted if your client is in an unstable environment that can’t keep a connection open.

Consider though, what would happen if the underlying connection was dropped just as the client issued a QUIT request. You sent the request but no response came back.

Download and delete a single message, but the connection fails a critical point.

What happened? We don’t know. We can’t know. There are three reasonable possibilities…

  • The QUIT command never arrived at the server. The server just saw the connection drop.
  • The server couldn’t process the delete and responded with an error, which got lost.
  • The server successfully deleted the messages, but the response got lost.

You asked for some messages to be deleted, but you don’t know if your instruction was processed or not. The only way to find out is to reconnect (when you can) and see if the messages you asked the server to delete has gone or not.

Let’s say that time has passed and the client is finally able to reconnect to the server again. Last time, the client downloaded a single message and may or may not have deleted it. Now we’ve reconnected we find a single message in the mailbox. Is this the one we deleted before or a new one that’s arrived in the interim? A handy directory listing would be real useful right about now!

This is why I would mistrust any mail reading software that didn’t require that a mail server implements UIDL. Messages might get downloaded twice or wrongly deleted if the wrong assumptions are made.

“Come back!”

The alternatives to UIDL are all unreasonable.

If the above doesn’t convince you that UIDL is necessary, this section is to answer anticipated responses that UIDL is not necessary. Nuh huh!

(If you are already convinced and you don’t want to read my responses to anticipated arguments, you can skip this section.)

“That scenario you describe won’t ever happen in reality.”

Stage one: Denial.

Where is this perfect world where connections don’t stop working at the worst possible time? Where database updates happen instantly? I want to live there!

Think about what a server needs to do to process a QUIT command. Many flagged messages will need to be modified in an atomic transaction such that they won’t be included next time. Indexes will need to be updated and the dust needs to settle before the server can send its acknowledgement. During this time, the underlying TCP connection will be sitting there idle, looking just like a timeout error.

“We wouldn’t have a problem if mail servers were better engineered!”

Stage two: Anger

If your requirements of a mail server include underlying connections over the public internet that never fail, I think your requirements are a little unreasonable.

“So I occasionally see two copies of a message in my mailbox. Big whoop!”

Stage three: Bargaining.

If that started happening in software I was using, I’d file a bug report.

“There are other ways POP3 can resolve this issue.”

Stage four: Depression.

Alas, all of these alternatives that POP3 provide are unreasonable.

You could use the response to LIST as a fall-back? This command requests the size in bytes of each message. Most messages are long enough that they will have a unique size, but this isn’t reliable. Messages are often going to have the same size as others just by accident.

You could use TOP to retrieve just the header and extract something from that to track messages? Problem there is that no single header is a reliable identity. Two adjacent messages might have the same date or the same subject. The closest candidate for a suitable identity is Message-ID but this is generated by the sender, who might not include it or might reuse IDs. If we’re relying on the POP3 server to add them or modify duplicates provided by a sender, we’re back to relying on optional features.

You could use the TOP response and hash the entire header? This could work except message headers can change. I first saw this when experimenting with a mail server and observed that if I connected to a mailbox using IMAP, it would leave IMAP’s version of a unique identity in the header which wasn’t there before. As well as that, anti-spam systems might re-examine a mailbox’s contents and update the anti-spam or anti-virus headers. Any of these changes would look like a new message.

(As well as all that, TOP is itself an optional command, just like UIDL.)

You could download the entire message again and ignore it if you already have it? This would be ultimate fall-back. While I’ve seen headers change, the message body seems to be immutable. This is still an unreasonable situation. We’re downloading the whole message again, just because the server chose not to implement a simple directory-listing command.

Am I certain that the message body is immutable? No, not at all. If someone commented that mail server XYZ updates messages in the form of a MIME attachment, I wouldn’t be at all surprised.

Update – A digression on the Message-ID header

(Added 28/Jan/2021)
I am grateful to commenter “theamk” on Hacker News, who responded to me when I shared this post. To my dismissal of Message-ID as a means of de-duplication, they noted that RFC standards require that Message-IDs must be generated as unique.

I have experienced senders who have broken the protocol, sending many different messages with the same Message-ID. I do not argue these senders were in the wrong but that the POP3 server is not in a reasonable position to correct the situation.

If the server actively corrected the situation and replaced the reused Message-ID header with its own unique value, the message would not be a faithful reproduction of the message as sent any more and further damage any scope for auditing.

If the server discarded or rejected the message with a reused Message-ID, it would open up means for an attacker to predict the Message-ID a legitimate sender is going to use and send a message with that ID first, causing the legitimate sender’s message to be lost. There’s nothing stopping a sender from using someone else’s Message-ID pattern. (Maybe senders should use only unpredictable strings, but wishing it so won’t make it happen.)

This is also to say nothing of the situation when the messages served up don’t have any Message-ID, which I’ve seen happen with messages exchanged within the local server only. (IE. Not routed over the public internet’s mail servers.) None of the small number of services inside the box from the original composer to the POP3 delivery agent supplied a Message-ID when it was missing, so the message turned up with the basic To/From/Subject/etc headers and a Received header, but no Message-ID.

Acceptance?

Because the alternatives are so unreasonable, I consider UIDL a requirement for handling POP3. Servers that don’t implement UIDL are bad servers. Clients that can work without UIDL are unreliable.

Still not convinced? Please leave a comment where you saw this piece posted.

“I’ve seen the future! I’ve seen the future! I’ve seen the future and it’s now!”

IMAP does it wrong.

The other popular mail-reading protocol is IMAP. In contrast to POP3’s download-and-delete model, IMAP’s model is that messages to stay on the server and are only downloaded when the client wishes to read it. This model enables mail readers on low-storage devices such as smartphones.

With IMAP, the IDs are restricted to numeric values and always go upwards, in contrast to the free-for-all “any printable ascii except spaces” allowed by POP3. While this may be nice for the client, by requiring a single source of incrementing ID numbers, it complicates matters for anyone wishing to implement an IMAP server using a distributed database as a back-end.

But the worse thing about IMAP’s message identity system is that the standard permits the server to discard any IDs it has assigned by updating a mailbox’s UIDVALIDITY property. If this value ever changes, it is a signal to the client that any unique IDs it may have remembered are no longer valid.

A client needs a reliable way to identify messages between connections to recover from an unknown state. It does not need for servers to have a license to be unreliable.

If a mail server that implements IMAP wants any respect from me, it would document that its UIDVALIDITY value is fixed and will never change and that the unique-ids it generates are reliable.

POP3 does it wrong too.

If I’m going to criticize IMAP for flaws in its unique ID system, I should address flaws in POP3’s system too, having spent most of this article praising it.

Quoth RFC 1939: “The server should never reuse an unique-id in a given maildrop,” (good) “for as long as the entity using the unique-id exists.” (no!)

Consider that worst case scenario. The client flags a single message to be deleted and finally issues a QUIT command to complete the translation. The server successfully processes the request but the response to the client is lost. As far as the server is concerned, the message is gone and there’s no problem, but as far as the client knows, the continued existence of that message is unknown.

Now consider a new message arrives on the mail server and because the RFC says it can, it assigns the same unique ID to this new message as the one that was just deleted. The client eventually reconnects and requests the list of unique IDs and finds the ID of the message it wanted to delete is still there. It doesn’t know the server used its right to reuse unique IDs and that this is actually a new message!

Now, I’ve never seen a mail server actually reuse a unique ID. The clever people who have developed mail servers in the real world seem to understand that reusing IDs is not something you ever want to do, even if the RFC says you can.

RFC 1939 also says, “this specification is intended to permit unique-ids to be calculated as a hash of the message. Clients should be able to handle a situation where two identical copies of a message in a maildrop have the same unique-id.”

Unique IDs don’t have be unique? Ugh.

This allowance only applies to identical messages. In reality, messages are never identical. After bouncing around the internet and going through various anti-spam and anti-virus servers, messages do accumulate a frightening number of Received: headers left behind from each intermediate hand-over. Each one with a time-stamp and its own ID number. Any one of these is enough to produce a distinct hash.

Picture Credits. (All Creative-Commons licensed.)
Listening to Radio Karnali” by “BBC World Service”.
List 84” by “Weisbaden 2010”.
The Time of Sunset” by Joy Sarah Nawati.
Future” by “Legosz”.
“PuTTY screen-shots” by me.

Why I willingly bought a Windows Phone

Without shame or apology, I use a Windows Phone. A bright orange Lumia 630. I purchased it with my own money. No-one pushed me to it or chose it for me. It was entirely my decision.

But why?!

Phones

My story starts in 2012 when I had outgrown my aging Symbian phone. After considering a number of options, I purchased an Android based Samsung Galaxy S2.

I had considered an iPhone at the time, but the main reason I didn’t was that I’d have to buy into the Apple ecosystem, which just wasn’t for me. My primary computer platforms were Windows based and moving to iPhone would be a big culture shock. My Samsung instead fitted into that world quite neatly and I’d remain happy with my choice for years.

Stage Fright!

In 2015, a security vulnerability (known as Stage Fright) was found in many versions of Android, including the one on my phone. All it would take was for someone to send me a malicious text message in the night and my phone would be taken over.

Not to worry, new phones had already been fixed and I was sure it would only be matter of time before that same fix would be pushed out to older phones like mine. Every day for a few weeks, I’d go into the phone’s check-for-updates system to see if a fix was available. Every day, there wasn’t. I’d call tech support to ask when (not if) a fix would become available. “Soon” was always the infuriatingly non-specific answer, occasionally along with the subtle suggestion that maybe I should buy a new handset instead.

Finally, I just couldn’t take it any more and gave up. My phone, despite being only three years old was considered too old to be updated. The risk of keeping it switched on, waiting for a drive-by attacker, was giving me too much stress. I switched the phone off and put it away, never to be used again.

Normally, there will come a natural time with each phone I use when I start to feel it is time to upgrade, having simply outgrown the old one. When that happens, I keep using the old one while I take my time to consider my choices. This time was different.

It was clear to me now that the Android ecosystem had a problem. Security vulnerabilities were not being taken seriously by the handset makers who would rather I just purchased a new device instead. If I had bought a new Android phone back then, I’d be supporting that attitude with my cash!

Choices

Having lost trust in Android, I was left choosing between Apple or Microsoft. At first, I wasn’t even considering Windows Phone, having had bad experiences with the platform some ten years earlier. Faced with an iPhone as my only choice left, I was willing to give the new Windows Phone a try.

Trying out a Lumia 630, I was pleasantly surprised. The tile concept was a welcome relief from the “Space Invader” style rows-of-icons that dominate the rest of the market. Suitably impressed with the whole package I ended up buying one and I’ve not looked back. (Except to write this.)

The lack of apps for this platform is a little annoying, but I get by. I have instant-messaging, a podcast player, a weather tile on the home screen and a few others. For everything else, I use a number of “M Dot” websites. (m.facebook.com, m.youtube.com, etc.)

The Future

How long, after having purchased a smartphone, is it reasonable to expect support in the form of security updates? Back when “Stage Fright” happened, I found that answer for the Android ecosystem was 1½ years. That’s just way too short in my book.

My Lumia 630 is around two years old as I write this and I’ve just installed an update that fixes the WPA2 “KRACK” bug. If I had purchased another Android based phone back in 2015, would I now have an update for this new bug? (Or, would I be back down the shops spending more money to enrich the handset makers who are laughing at the chump that I am…)

While I’m not planning on replacing my phone any time soon, its likely I will feel I’ve outgrown it in maybe a couple of years down the road, especially as Microsoft have announced they will not be actively developing it any more except for those security updates. When that day comes, I hope Android will have taken a tip from Microsoft on how to do updates right.

Picture Credits
Microsoft Lumia 630 running Podcast Lounge. By me, ironically enough, using an iPhone.
Tension, 91/365 by Matt Harris.
Future by “Legosz”.
(Pictures are Creative Commons licensed.)

Is your API broken?

“Welcome to the Example Rutabaga Company. We’ve got a simple REST API for all your rutabaga needs!”

Indeed, it is simple…

   POST https://rutabaga.example.com/Order/ HTTP/1.1
   Content-Type: application/json

   {"Quantity": 5800,
    "Quality": "Tasty!",
    "DeliverTo": "123 Fake Street, New Orleans"}

Send this and you’ll either get an error or an “OK” response with a tracking ID inside. Later, you’ll get several thousand tasty rutabagas in the post. What could go wrong?

Everything.

Schrödinger’s Response

From the client’s point of view, there’s a clear action to take depending on the response code.

  • 200, log the tracking ID.
  • 5xx, try again later.

But what if there’s no response? Perhaps your friendly HTTP client library code has thrown an exception because the connection has broken down. These errors are unavoidable, especially when the client is on a mobile device. What should we do in this situation?

You could try again later? But hang on, this violates the thing that makes POST different from GET and PUT. (GET and PUT are designed to be repeatable, but POST requests are express calls to take action.)

You might reason that the first POST request failed, so you’re not actually repeating anything, but aren’t you? There are two possibilities when you get an error from any sort of network request.

  1. The request was lost on the way and the remote server did not handle the request.
  2. The request arrived and was handled, but the response to the client was lost.

If A, we’re fine to repeat the POST. No problem.
If B, the remote server is already in the process of shipping a truckload of rutabagas to you and has no idea the response got lost. Repeat that request and you’ll end up with two truckloads of rutabagas.

But this is the point, the client has no way of knowing if its A or B. The only entity that knows is the server and we can’t talk to it.

For a surprising number of APIs I’ve written client code for, that’s the end of the story. The API simply has no reliable way for the client to find out what happened.

How does your API handle this situation? Is your API broken?

Opening the box

One way an API designer could resolve this issue is to provide a way to look up the order history.

This is probably what you’d do if (say) you were shopping online and your internet connection died just as you hit the Complete Purchase button. Once you got back online, you’d check to see if the order was in the system before repeating the order.

Sounds simple? This would work but be careful, for alas, this approach has lots of caveats. Fortunately none of them are really insurmountable.

Beware of false duplicates

Say you’re in this worst case scenario and your link to the server has just been restored. Your code dutifully downloads the list of outstanding orders and finds one for 5800 rutabagas. Job done?

Wait! Was that your order? Maybe the account holder deliberately made another identical order from a different machine. We don’t know – We can’t know.

This can be resolved by ensuring the client has the opportunity to supply its own way to identify the the initial request – perhaps with a client supplied ID – and allowing for a lookup later on.

How long should we keep that ID around?

Expire ID records too quickly and a client that’s been offline for a prolonged amount of time will not be able to resynchronize. Store the IDs forever and that would be a waste of space.

You may have a figure in mind that’s reasonable. If not, add an occasional reconciliation of expired IDs to your API.

Who chooses the ID?

The client should be able to freely chose an ID. You may be looking at your database and thinking there’s a field supplied by the client that’s already got a no-duplicates constraint. If those values came from a source external to the client, it won’t be able to control the uniqueness of those important values. That external entity might very well be feeding identical records into the system through different channels and the client won’t know if that duplicate it found was their own or someone else’s.

Whose ID is it anyway?

Make sure the client has a clear space from which to select IDs. We can’t have multiple users all counting from 1 because you’ll get collisions very quickly. A GUID would work as long as they are generated correctly. Maybe if the API requires that the client log-in first, the server could track IDs on a per-user basis, but not all APIs require a log-in or pre-registration.

Avoid colliding with prior attempts still being processed.

Consider this: A client attempts to send a request to a server, but the connection fails with a time-out error. Thirty seconds later, the client asks the server if that prior request made it, which it answers “No”. Time to repeat that first attempt?

But wait! That first attempt timed out because the server was unexpectantly busy and has only just started dealing with your first request.

You can mitigate this (probably rare) scenario by making sure the server will return an error to the second POST request. Almost all DBs allow for any field or combination of fields to have a uniqueness constraint and the error will just happen if this scenario ended up playing out.

Do you have a ticket?

There’s another protocol that works in a similar way but puts the server in control of the IDs, at the cost of requiring two separate phases. (The actual request could be carried along with either the first or second phases.)

The first phase has the client asking the server for an ID while the second phase has the client committing to complete the transaction with that ID.

This protocol does require that when the client begins phase two, they have committed to not return to phase one for this transaction. The client must also store that ID and be ready to use it for when the connection has been restored. Similarly, the server needs to agree that it only starts processing a transaction once the second phase request has arrived.

This two-phase approach covers for failures at any step along the conversation, so long as the client and server stick to the agreement.

  • If the first request is lost, there’s no problem in repeating the first phase.
  • If the first response is lost, the server will have allocated an ID that will never be committed, but will be left indefinitely in an uncommitted state. (A later occasional reconciliation of orphaned IDs would be useful here.)
  • If the second request is lost, the client can later repeat the commitment of the transaction after checking its state using the ID it received in the first phase.
  • If the second response is lost, the client can later check the state of the transaction using the ID and see that it is already committed.

This protocol has a similar caveat from the earlier plan – How long should the server keep track of used ID numbers? The server will be left with IDs that will never be committed as well as committed IDs that the client might still need to check up upon later. Again, you may wish to come up with reasonable time limits or allow for a reconciliation of IDs later on.

While this protocol might be considered more complicated because of the two phases of conversation, there are fewer caveats to this plan and fewer oportunities for things to go wrong. This is my personal favorite.

Do I really need to do this?

As I write this I’m also working on a small web service that uses a REST API with POST requests, but taking none of the advice I offer on this page. Why not? Simply that the cost of the resources being allocated by this API-to-be are so close to zero that making the effort to implement the API robustly is just not worth it in this particular case.

But consider, even if you’re not transmitting invoices worth thousands of dollars, do you really want duplicates turning up?

Picture Credits
“Rutabagas” by Dale Calder
“Barney the cat” by Bill P. Godfrey (me).
“Rutabaga 2” by Dolan Halbrook
“Commit no nuisance” by Pat Joyce

I need a good podcast catcher (and a bit of a rant)

I listen to podcasts on my daily commute. These are radio shows that can be downloaded over the internet and listened to later. However, to keep up with a weekly show, I’d have to – every week – visit the show’s website and manually download the latest episode. That would get real tedious real fast. To resolve the tedium for us all, the podcast catcher app was invented.

Podcast catchers allow me to list all the shows I want to listen to. Every day or so, it automatically checks each show on the list to see there are any new episodes for me. If it finds any, it downloads them and plays them for me.

Currently, I use Google’s ‘Listen’ app, but that service is about to be closed down with the imminent closure of Google Reader. I need to replace it. I’ve downloaded a handful of alternative apps, but they all lacked a feature I find essential. I remain a little flabbergasted that any podcast app out there does it any other way.

“She smoothes her hair with automatic hand and puts a record on the gramophone.”

My daily commute is ~45 minutes of driving each way, so for me, a good player needs an Auto-Play mode. When one show finishes, another should start playing right away. There’s very few places I could safely pull-over and having to push buttons while I’m driving is right out.

But not just any Auto-Play mode. Oh no. All the apps I tried had an Auto-Play mode, but they all did it so very badly.

Ask yourself – When a show finishes playing and Auto-Play is switched on, which show from the list of unplayed shows should your app select to play next?
   A. The one that’s been waiting in the queue longest.
   B. The one that appears next in the list when sorted by episode title.

Did you pick A or B? Sorry, they’re both wrong, and yet these were the only options available on an awful lot of podcast apps.

The right answer, is to play the one the user has queued up next. The “In the order I want” sort criteria. No really, who is actually asking for the order of play-back to be strictly enforced? Would anything else, perhaps, offend your sense of politeness?

   “You want to listen to the latest Cognitive Dissonance show? But what about this episode of Hanselminutes? It has been waiting paitiently in line and this is its turn to be played.”
   “I say! That would be jolly impolite of me. Don’t want to hurt the feelings of those audio files. Pip pip!”

“I sat upon the shore, fishing with the arid plain behind me. Shall I at least set my lands in order?”

With Google Listen, new episodes join the listening queue, but I can arrange them in the order I like. If I’m just not in the mood for the next episode in line, I’ll select another episode that I do want to listen to and bring it to the top using the ‘Move to the top of queue’ button.

Once I’m happy with my selection of the next hour or so’s worth of stuff at the top of the queue, I hit play and drive off. As the first show finishes, its taken off the queue and the next episode I had queued up starts playing, all without any interaction.

The few alternative apps I downloaded did not offer this. It seems such a simple thing and yet I can’t imagine the insanity of not being able to control the playing order.

If one, settling a pillow by her head should say, “That is not what I meant at all.”

Some people reading this, I’m sure, are thinking “He wants a playlist manager”.

To manage a playlist, you’d need to first create a playlist and give it a name. Then you’d need to add shows to the list and save it. Then once its played you’d need to delete that playlist and start a new one.

No. That’s just another level of insanity. All I want is a button on each episode labelled ‘Move to the top of the queue’. That’s it. If I have to perform some ritual every day to create a new playlist or whatever before I can get that button, I’m not going to be happy. Life is too short for pointless ritual.

Maybe if your UI is so user friendly that the ritualistic parts of your playlist manager just disappear, that’s fine but that’s not what I’ve seen out there.

“Oh, I have to chose a name for this new playlist. Why not just pick a random name for me? I’m only going to delete it in an hour’s time anyway.”

So there is my plea. Does anyone please know of a podcast app for Android phones that implements its Auto-Play mode… correctly? I will happily pay a reasonable subscription fee for good quality software.

If you’re an app developer and your podcast app does it correctly, please feel free to use this page’s comments for some free publicity. On the other hand if your app doesn’t do it right, please treat this page as a bug report.

Picture credits:
Day 30.06 Voices on the radio!” by Frerieke on Flickr.
Listening to Radio Karnali” by the BBC World Service.
The section titles were borrowed from The Waste Land and The Love-Song of J. Alfred Prufrock, both by T.S. Eliot.