Mojibake: Preventing Alien Characters
What’s mojibake? What are “alien characters”? As you probably guessed, “mojibake” is a Japanese word, and it means “changed characters”. For example, here’s the word written in Japanese: 文字化け. Chances are that you saw Japanese characters, question marks, or boxes. The first will happen if you have the font installed; the second two will if you don’t. But if there’s a mojibake problem, it would look something more like æ–‡å—åŒ–ã‘, or other such gibberish, no matter what fonts you have installed. This was, and sometimes still is, a particularly common problem for speakers of some languages, such as Japanese. Some Japanese people joke that mojibake might be messages from the dead or from aliens. Mojibake certainly looks alien to human eyes! Oh…and by the way, if your website only uses English, don’t think for a minute that this stuff doesn’t apply to you. It does. We’ll get to that in a second.
Mojibake is sometimes the fault of the web browser (if the web browser is very old), but it’s almost always the fault of the webmaster, who failed to specify an encoding for the webpage. The only time no encoding is necessary is when you use pure ASCII: only letters, numbers, spaces, and basic punctuation. Anything like accent marks or even curly quotes are strictly off-limits unless you use markup like é to write é—which is a valid way of doing things, of course. But even if you write only in English and normally stick to ASCII, you might end up copying and pasting something from another webpage that uses “café” instead of “café”, or something that uses curly quotes, and you may end up with the problem. What’s worse, it might look fine on your computer and not somebody else’s, so you won’t even be able to spot the problem by previewing the webpage. So how do you prevent this problem?
The HTTP header is an important but often-overlooked method, so I’m going to cover it first. A lot of people think they can’t do anything about this. If you’re one of them, keep reading: you might be wrong! Simply put, the webserver is supposed to specify the encoding in the HTTP header, for example:
Content-Type: text/html; charset=ISO-8859-1. (If you don’t know what encoding your page uses, and you’re in the Americas or Western Europe, it’s probably ISO-8859-1.) First, your server might already be configured to do this correctly; just check the headers to see if an encoding is specified. If you don’t know how, just go to a website like web-sniffer.net, type in the URL, and look at the response. You’re looking for a line that says something like
Content-Type: text/html; charset=ISO-8859-1. If it’s there, including the charset (and the charset is the one you want), great! Skip to the next section. If it just says “text/html” without the charset part, or there’s no Content-Type at all, it’s not set up and you have to do it yourself.
If all your webpages have the same encoding—they probably do, or you’d have had to deal with this much sooner—then simply configure the server to serve this by default. What? You don’t have access to the server config file? Well, that’s not always a problem! Many web hosts let you use .htaccess files to specify configuration settings. If your host doesn’t, I’d recommend that you switch to one that does, because they’re very useful. If you don’t know if you can use .htaccess files, it’s pretty easy to find out. Simply create a file named .htaccess (on Windows, you may have to rename it to “.htaccess” at the command line, since it doesn’t like filenames starting with a period), add the line
AddDefaultCharset ISO-8859-1 (or whatever encoding), and put it in your document root, then check the HTTP headers again. If it’s still wrong, just delete the .htaccess file; it’s not gonna work. If it’s right, skip to the next section.
If that doesn’t work and you use a language such as PHP, you may be able to set the header there. Simply add the line
header("Content-Type: text/html; charset=ISO-8859-1"); (this code must be executed before outputting any text), do a quick sanity-check on it in your web browser to make sure the page still works without errors, then check the headers again. If the header is still wrong, well, nothing’s gonna make it work, sorry.
If none of this works and your headers have no encoding specified, proceed to the next section. If they specify an encoding other than the one you want, you’re going to have to give up and use the server’s encoding for your pages (or stick to ASCII). Proceed to the next section, using the same encoding for the meta tag that the HTTP header has.
Even if you pass the encoding in the HTTP header, you need the correct meta tag in your code. (You definitely need it if you didn’t pass the encoding in the header!) At the top of the
<head> of your document, put this:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> (insert a space and a slash before the
> if you’re using XHTML). If you send the correct HTTP headers, the browser will skip over this meta tag, but it still needs to be there because the user may save the document to the hard drive. When the user opens the file again, guess what? There are no HTTP headers at all.