Wednesday, August 27, 2014

Unicode, UTF-8 and character encodings: What every developer should know

UPDATED 8/27/2014 to expand, clarify and expound certain concepts better

If you haven’t already read the excellent article by Joel Spolsky entitled, “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)“, then you definitely should. However, I think it makes a lot more sense if you understand a few basics first. If you’re like me, I still didn’t quite “get it” after reading through that article the first time or two.

The Basics

For many years I was always confused by the whole mess. I didn’t understand exactly what Unicode is or what character encodings are all about or what the big deal was with UTF-8. Turns out it’s all pretty simple (sort of). I won’t dive into the history since Joel covers that pretty well in his article and I’m not nearly as qualified. Instead, I’ll just try to explain the basics as simply as I can.

First, it’s important to understand what a character set is. A character set is just a set of symbols (many of which you may recognize as letters and punctuation) and a code (number) that represents those symbols. You may be familiar with the ISO 8859-1 character set since it’s a pretty common ASCII based character set. That’s what maps the letters and other symbols you know to the codes 0-127.

Unicode itself is just a character set - one that’s backward compatible with the first 128 common ASCII character codes (0-127), meaning it maps those same 128 symbols to the same numbers as all other ASCII based character sets. Each character mapping is known as a code point. Unicode is a system of code pages like any character set, that is maintained by the Unicode Technical Committee who meet quarterly since there are always new characters being added to the Unicode character set. The code pages say nothing of how these code points will actually be represented in binary. There are a lot of ways of representing them.

A method of translating numeric code points into binary is known as a character encoding. You’ve probably seen things like U+00A3 or U+00E8. These are Unicode code points. A character encoding is a systematic way of representing these code points in memory. The most widely used character encoding is called UTF-8. As Joel says in his article, it makes no sense to have a string without knowing its character encoding. You can’t really read it. How are you supposed to translate a set of bytes back to the number/code point they represent if you don’t know the method used to translate it to binary in the first place? You can try to guess the encoding in many cases, but that just makes life hard and it’s far from fool-proof.

Understanding all of this comes into play as a developer in two primary ways:

  1. You need to make sure that when you are sending strings around you are also sending the character encoding. As a web developer this can be done in the headers when a document is served, but also with a <meta> tag in the <head> of your HTML document. This allows odd characters to be displayed correctly by the browser.

  2. If you are working in a language that isn’t aware of multi-byte characters (I’m looking at you PHP!), then you can run into a whole slew of problems trying to work with strings. Everything from referencing a character in a string (in PHP $str[3] represents the 4th byte in the string, not the 4th character), to simply determining string length (echo strlen("ÿ"); will output 2 since it takes 2 bytes to represent ÿ a.k.a. U+00FF).

This isn’t something you can afford to just ignore. Whenever you’re dealing with user input in PHP you cannot assume that one character == one byte. Every time you try to measure or manipulate a string you need to use the mbstring functions or something like the Portable UTF-8 library for PHP; otherwise you run the risk of breaking characters in your strings which leads to all sorts of weird errors (like when trying to json encode, for example) and funny characters.

UTF-8 Encoding

I think it’s important to understand how UTF-8 encoding actually works because it will give you a better idea of what character encoding actually means. UTF-8 is the most widely used, and I’ll even say the best character encoding you could choose to use for your website or software generally (unless you’re dealing with a legacy system that is already built to communicate with a different character encoding). I say this because:

  1. UTF-8 can represent every Unicode code point or every number in the full set of code pages the character set is currently comprised of.
  2. It is compatible with ASCII (e.g. the ASCII representation and the UTF-8 representation for all code points from 0-127 are identical - ASCII is valid UTF-8).
  3. It’s pretty conservative on space compared to some of its sibling character encodings. There is no fixed or minimum (greater than one byte) character length. UTF-8 character encodings can be anywhere from 1-4 bytes, depending on the character. In normal English text the space required will only usually be 1-2 bytes per character.
  4. UTF-8 does not require or even use a byte order mark (see http://en.wikipedia.org/wiki/Byte_order_mark and http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark)

To truly understand how UTF-8 encoding works, I have found no better explanation than this table found on Wikipedia and the explanation following it. It does an excellent job. I would summarize the encoding methodology as follows:

  1. For single byte (mostly ASCII) characters, the high-order bit is always 0 for backward compatibility with ASCII.
  2. For multi-byte characters the first byte begins with between two to four 1’s representing the number of bytes the character will use, followed by a 0.
  3. The rest of the bits in the first and following (continuation) bytes are filled with the bits representing the code point, except that each of the continuation bytes begin with 10.

So from the example above (U+00A3) we would first take the decimal value of hex A3 which is 163 and figure out what that is in binary, which happens to be 10100011. You’ll notice that it takes 8 binary digits to represent it in binary. Since the first byte of UTF-8 always must begin with 0 for single byte characters for ASCII compatibility, we are going to need two bytes to represent it. Since it’s a multi-byte character the first byte will begin with two ones to indicate how many bytes it will use and a zero, followed by as many of the bits of the code point as we can fit.

However, 5 of the 16 bits of the two bytes required will be used by the encoding format, so that leaves 11 bits we have to fill with the code point, which means we have to pad it with 0s (00010100011). So the first byte looks like:

11000010

The next byte is a continuation byte which will always begin with 10 followed by as many of the bits of the code point as will fit. So the second byte is:

10100011

If you’re still confused, this example might help. Armed with this knowledge you can now easily take any Unicode code point and write out the UTF-8 binary encoding of the character. Easy as that!

Common Problems

This is based solely on my own experience, but I just want to cover a couple common problems I’ve run into and how to fix them. In PHP my general recommendation is to use the excellent forceutf8 library written by Sebastián Grignoli, and just pass all your arrays or strings through its toUTF8() function.

Double encoded strings

You can get some pretty weird characters showing up, even when you set your page character encoding correctly and think you’ve encoded all your strings correctly, if you are double encoding your strings. I’ve seen this a few times particularly when developing JSON APIs. If some strings are already UTF-8 encoded when you build your result set and then you end up (re)encoding the whole thing before sending, you could end up with some double encoded strings.

To fix this, the solution is simply to check each string within your potentially nested result data individually and see if it’s already encoded properly, and only encode the strings that aren’t. My first recommendation is to just pass your string or array to the toUTF8() function in that forceutf8 library I mentioned above. Otherwise, I wrote this up as a simple solution before I ever found that library:

function utf8EncodeArray($array) {
    if(is_string($array)) {
        $array = array($array);
    }

    foreach($array as $key => $value) { 
        if(is_object($value)) { 
            $value = (array) $value;
        }

        if(is_array($value)) { 
            $array[$key] = utf8EncodeArray($value);
        } elseif(is_string($value) && ($encoding = mb_detect_encoding($value)) != 'UTF-8' && $encoding != 'ASCII') { 
            $array[$key] = mb_convert_encoding($value, 'UTF-8', $encoding);
        }
    }

    return $array;
}

Of course, if you’re already dealing with a string that’s been double (or more) encoded then it needs to be fixed. In that case, the forceutf8 library I mentioned above has a handy little fixUTF8() function you can call that will try and repair the string.

Broken encoding

Another problem I’ve run into is where UTF-8 encoded characters get truncated because of PHP’s naive handling of strings with multi-byte characters. This first became a major problem for me when trying to json_encode() a string with a broken UTF-8 character. It just completely fails. In this case you have to find the character and then either remove it or try to fix it. In reality you probably won’t be able to figure out what the character was supposed to be so you can either replace it with a placeholder character (question mark?) or just add something valid for the truncated byte(s). Once again though, that magical forceutf8 library will fix it for you if you don’t care too much and just want to make sure your broken strings aren’t breaking things.

Conclusion

I hope this was able to fill in some of the gaps for those who may have been a little lost. In the end, it’s not that hard of a concept to understand. It makes a big difference when you actually understand character encodings and you really shouldn’t write code without a basic understanding. Now go forth and write better code!

No comments:

Post a Comment