Good and Bad URLs

Designing for permanence

This essay will discuss permanent URLs. In many cases, it would be appropriate to design your URLs so that they will still be good years from now. This article is based in part on Cool URIs Don’t Change by Tim Berners-Lee (the guy who invented the World Wide Web), although I’m taking a somewhat different approach, focusing on the more practical aspects of today while planning for tomorrow. (Oh, just to be clear, in the context that Tim uses it, “URI”, a Uniform Resource Identifier, is the same as a URL. The two terms are not interchangeable, as all URLs are URIs, but not all URIs are URLs. A lot of what’s said could also apply to pure URIs, though.)

A permanent URL is not necessary if you don’t expect your webpage to be around for long, or if you’re running some small personal homepage and you really don’t mind if all your links might break in the future. But if you’re running a serious webpage, you’ll want to design your URLs to last. Permanent URLs are also generally not necessary for media other than text or hypertext. After all, you typically don’t expect (or even allow) an external website to link to an image or sound on your server, so the URLs for them don’t need to be built to last. Anything you expect to be linked to externally should have a good URL, though.

(Note: When I migrated this site to Jekyll, I did change the URLs of my images. I found that there was indeed a reason to keep the image links the same, if I wanted to: Google Images. So if you care about Google Images, don’t change the URL of your images.)

By the way, I’ll disagree with Tim Berners-Lee on one point: that a URL should ideally never change. I think it’s OK to let a URL change sometimes, even in nearly ideal circumstances. But if that URL changes, the old URL must point to the new URL (or at least be a page that has a link to the new URL). If that new URL also changes, another redirect needs to be put in place, and you may also want to change the first one to point directly to the new place (some browsers have a redirect limit of 5 times). Obviously, it’s easier if you rarely let your URLs change in the first place.

What makes a URL good?

A good URL is completely boring. It tells you exactly what you need to know, no more, no less. As you’re probably aware, URL stands for Uniform Resource Locator. Let’s dissect the meaning of this phrase, to better understand what we’re up against.

Uniform refers to a universal standard. All URLs follow the same basic format, with each component having a basic meaning.

Resource refers to the thing being located. A resource is often a file, but it doesn’t need to be. It could be the result of a database query, for example. It’s better to think of “resources” as abstract things separate from files. We’ll return to this point in a moment.

Locator refers to the means of finding the resource. This manifests itself in the form of a domain name, like www.example.com, and a pathname, such as /games/chess/.

So, a URL is simply a means of finding a resource. That’s all it should do. Now let’s look at some bad URLs to see what we should not do in order to better understand the idea of keeping it simple. The first place I went to is microsoft.com, since they’re notorious for having badly-designed URLs. Here’s the URL (as of this writing; no doubt it will change) for the download page for MSN Messenger:

http://www.microsoft.com/downloads/details.aspx?
FamilyID=0b88ccbf-4c52-4347-aa71-87184a13ac1c&DisplayLang=en

Look at that monstrosity! Heck, I had to split it up into two lines so that it didn’t run off the edge of the page (at some resolutions). We’re fine until details.aspx. The .aspx signifies nothing to the user and does not belong (see below for more on the matter of filename extensions). The FamilyID string means nothing to the user and just makes the URL long. The DisplayLang=en part isn’t as bad, because that the resource is in English is relevant to it. Finally, the use of a query string is ugly and it signals the use of a script, and the use of a script is irrelevant to the user. A better URL would look something like this:

http://www.microsoft.com/downloads/MSN_Messenger_7.5/Windows

It’s informative, to the point, and doesn’t contain any crap. It clearly, unambiguously, and succinctly states that this is the place to download MSN Messenger 7.5 for Windows. Notice that I have /Windows after the MSN_Messenger_7.5 bit. It isn’t a huge issue, but it makes more sense to me to put the name of the application before the platforms, because the application itself is more important than what platform it happens to run on. Doing it the other way still gets the point across, though.

This is just the way the URL is presented to the user. It doesn’t have to have any relation at all to the way things are done under the hood. In fact, how things are done under the hood is irrelevant to the user, so such details should not get in the way of a good URL.

Some of the worst URLs in the 1990s were also some of the most common: GeoCities URLs. They later scrapped this system since then since they realized how silly it was, but originally their URLs looked something like this (to pick a random example):

http://www.geocities.com/Area51/Zone/6338/

“Area51” designated a science-fiction-themed website, but the name isn’t very informative. “Zone” is a meaningless subclassification, as if to designate a neighborhood, and the number 6338, which specifies the name of that particular website as opposed to all the others under “Area51/Zone”, has no meaning at all. None of this indicates in the least that this is the personal homepage of a man named George Seto, or what one might find there. On the other hand, an alternative URL for the same page is:

http://www.geocities.com/george_seto.geo/

which is perfect except for the .geo part, since it’s George Seto’s website. (It wouldn’t be a good name if the webpage wasn’t actually about George Seto and just happened to be created by him, however.) Aside from that, the only way to improve the URL further would be to get a domain name so that the webpage could move in the future without breaking links, but this isn’t a big issue most of the time, especially for a personal page like George’s. (Update: GeoCities is now dead. Supposing George wanted to keep his website, I guess he’d have to move after all!) Perhaps a good compromise would be george-seto.geocities.com, since it’s more to the point that it’s a website of its own that happens to be hosted by GeoCities, rather than a part of the GeoCities website itself.

Filename extensions

Almost every website on the web uses names such as index.php, wiki.pl, about.html, and so on. When we see a name like this, we think nothing of it. Isn’t it the way everybody does it? But I’ve come to question this convention.

The problem is that it bundles interface and implementation. If you’re not a software engineer, you may be unfamiliar with this distinction, so let me explain. An interface is how something is presented to its user. The implementation is how it’s actually done by the computer. In the case of URLs, the interface is the URL itself, and the implementation is the filename on the disk system, as well as details such as what markup language that file was written in.

Now I’ll explain why they should be separate. Suppose you have a website written in pure HTML, and so each URL has names like index.html, about.html, pizza.html, and so on. A few years later, you’ve learned some PHP and you decided, wow, PHP is really cool! You want to convert all your pages to PHP, but all your pages have .html in the name. You have a few options:

  1. Change all the .html files to .php, update the links manually, and let all the links from other websites break, to the frustration of users and other webmasters. This is probably the most common choice for amateur websites.
  2. The same as above, but create redirects from the .html files to the .php files. If you can use rewrite rules to take care of this for you, this isn’t so bad. But if you’re going to use a rewrite rule anyway, why not use it to drop the filename extension?
  3. Keep the .html extension even though the file is no longer pure HTML. This is a bad idea if you do it by simply processing all .html files with PHP, since that means you won’t be able to use any other technologies. If you use a rewrite rule so that the .html extension maps to a .php file, that’s not so bad, but again, if you’re using a rewrite rule anyway, why not drop the extension?
  4. Drop the filename extensions, preferably while creating redirects.

Some people will argue that the .html extension makes sense no matter the underlying technology, because what the browser receives in the end is HTML. I still think that is an unimportant detail. For example, if you use XHTML, then the .html extension becomes technically incorrect, since XHTML is not HTML. (Some people use the same extension, but I feel that because HTML and XHTML use fundamentally different underlying technologies—SGML and XML, respectively—they should not be treated as though they were the same.) In the future, the page could even be written in something else entirely.

If you had only used the extension .html internally, and you always left it off your URLs, then there would be no problem. You just rename the .html files to .php, configure the server if necessary, and all links, external and internal, still work. The only time a link won’t work is if some guy had added .html to your URL when he shouldn’t have, and that’s his fault, not yours.

Furthermore, consider that the language your file was written in is irrelevant. Whether it’s written in HTML, SHTML, PHP, ASP, Perl, or something else entirely, the only thing the browser sees is pure HTML. The browser doesn’t care that your neat little webpage was written in PHP. The guy using the browser doesn’t care, either. What about people linking to the page? Nope, they just care that the URL works. So when you put .html or .php at the end of the filename in the URL, you’re putting something in the URL that really has nothing to do with what the user sees. That’s bad. Since that little bit that has nothing to do with what the user sees is also subject to change, that’s worse.

Configuring the server

Chances are your server isn’t configured to allow you to refer to something without its filename extension. Certainly, you should keep the filename extensions in the internal file structure. They’re useful there, because they tell you something you need to know: this file was written in PHP and not HTML. That’s not relevant to people browsing your site, but it’s relevant to the webmaster and to the server. So you need to configure the server to allow you to just hide the filename extension in the URL. Apache offers two ways of doing this: content negotiation (simply use “Options MultiViews” in your .htaccess or httpd.conf), and rewrite rules.

What if you can’t?

I understand that you may have your webspace on a server that you can’t configure this way. In that case, I guess you have to either live with the “evil” filename extensions (not the end of the world, as well over 99% of the web does it), or find a server that allows it. For a small personal page on a service such as GeoCities, you’re probably not going to care about whether or not your URLs are always valid, so don’t worry. But for more serious webpages, I actually recommend the server switch—not because of the filename extension issue, but because it’s nice to be on a server that lets you configure it any way (or almost any way) you need it.

It’s your website anyway

I’m not out on a crusade to get everybody to use clean URLs. I realize I don’t run the web, and it’s probably a good thing that I don’t. If you’re running a website, what you do with it is generally your business. I am not a “URL nazi.” But I do hope that I have convinced you that clean URLs are good, whether or not you’ll actually use them.