ISO-8859-1, UTF8, Character Sets, Encoding, Movable Type and You

| No Comments | No TrackBacks |
During normal use of Movable Type you probably never have to worry about character sets or encoding.  But when you migrate your installation to a new server or you add a Japanese, French, German... guestblogger, that is when things can get tricky and it might be worth knowing a thing or two about what is the deal with strange characters, accents, ümlauts and all the interesting ways in which they can fail to display.
Character sets and encodings
First, a little explanation about character sets and encodings.  As you may or may not know, computers represent almost all data as numbers.  This text you are reading?  Nothing more than a bunch of numbers in a specific order sitting on some webserver somewhere.  So you might be wondering why you are seeing a bunch of letters right now.

This is where 'encoding' comes in.  An encoding system is basically a table mapping numbers to letters.  A primitive cypher like a = 1, b = 2, c = 3, and so on... can be regarded as an encoding system, even though it is very simple and limited to just a few characters.

For various historic reasons, a number of different encoding systems have been used over the years to represent the various characters that make up the languages of the world so they could be displayed by computers.  Some of them are fairly well known, and their names may sound familiar: ASCII, ISO-8859-1, UTF-8, latin-1...

For the same historic reasons most of these systems encode the basic letters and punctuation marks of the Roman alphabet in exactly the same way.  So if you are a computer and you want to 'translate' a string of numbers into text, no matter which of these encodings you use the result will look fairly readable as long as the text in question only contains basic Roman letters, numbers and punctuation marks.

Where it can go horribly wrong
If a text contains more than just these characters, things get interesting.  Almost all the other characters are represented by different numbers in the different encoding systems.  So if you use the wrong system to decipher a string of numbers you will end up with the wrong characters on your screen.  Or you might even run into numbers that are not in your encoding table at all, leaving you unsure about what to display.

It is in such cases that your computer starts acting funny: either random-looking garbage characters are displayed in your text wherever you were expecting accented characters, ümlauts and other assorted special signs, or you see little squares, question marks or other symbols indicating the computer is not sure what to display there.

Encoding and Movable Type

By default, Movable Type uses the UTF-8 character set for everything, as this is the default used by the Perl scripting language in which Movable Type was written.  UTF-8 can be used to represent almost any written language that exist, so it works for all kinds of scripts and languages.

There are a number of places in a Movable Type based system here a different encoding can be configured, leading to interesting results if there are mismatches:
  • The default character set reported by the webserver
  • The publishing character set in Movable Type's configuration file
  • The character set of the database
Webserver character set
First, a little side trip to explain about webservers: webservers basically are programs running on big computers (servers) that deal with requests from programs on other computers (i.e. browsers).  A browser will ask: "can I have file X please?" and the server will send it so the browser can display it.  But the server will precede its reply with a number of 'headers' containing information about the file that is about to be transmitted: how big is it, what type of file is it... and what encoding is it in.

For cgi scripts running on the server the story is a little bit different but similar: here it is the script that generates the headers first, which the webserver then passes on. The administrative interface of Movable Type is such a script, an this is why it sends out its own headers (including encoding information) when you request it with your browser.

The browser will then use the proper encoding table matching the encoding reported in the header to make sure the right characters end up on the screen.  At least, that is the theory.  Sometimes the webserver is set up to always indicate a certain encoding is being used for all the files it doesn't know the encoding for, while in reality the files might very well be encoded differently.

Symptoms of this: the administrative interface of Movable Type looks fine, as do all the entries in the edit entry screen.  But the published versions contain weird characters when you view them in your browser.  

(Hint: most browers have a 'Character Set' option in one of the menus.  The character set reported by the webserver for the page you are looking at will always be selected automatically in the list that appears when you open this menu.  But if you pick a different character set, your browser will interpret the page as if the encoding had been reported differently.  If you can make the page look OK by picking another character set, it means that you found the 'real' character set.)

How to fix it: in the configuration file of your webserver, change the default character set that is being reported.  For the widely used apache webserver, this means adding or changing the AddDefaultCharset option in the httpd.conf file, for example by adding this line:

AddDefaultCharset UTF-8

Note: this assumes you know what you are doing and that you have the proper rights to edit the webserver configuration.

Movable Type's PublishCharset configuration directive
If changing settings on your webserver is not possible, at least you can make Movable Type publish its output files in the character set that the webserver claims all files to be in.  Simply add a line to the mt-config.cgi file in the folder where Movable Type is installed.  The line should look something like this (with your encoding instead of iso-8859-1 if you need a different one):

PublishCharset iso-8859-1

What does this PublishCharset directive do?  It will make the Movable Type admin interface appear encoded in the specified character set, with the proper headers sent along so your browser knows how to display it.  More importantly, this also means all entries that you submit through this interface will now also have this encoding.  So this is how things will end up on in the database and (hopefully) on the published pages.

Database
Of course, your database can also have settings for the encoding that is to be used.  If this doesn't match the data that is being sent to it, interesting things may also happen.  So make sure your database encoding is set to match the other two as well, just to be on the safe side.

In conclusion
  • Make sure the encoding specified as default by the webserver, the encoding set up in Movable Type and the encoding of the database all match.
  • Ideally, you should use UTF-8 as it is becoming more and more of a standard on the web.
  • Once everything matches, you can clean up any items that still look odd by using the search-and-replace function in Movable Type

No TrackBacks

TrackBack URL: http://www.movabletips.com/cgi-bin/mt/mt-tb.cgi/37

Leave a comment