Unicode and Character Sets
I have a lot of developer friends who are still confused about the idea of character sets. The internet is a global phenomena; in today’s world, every developer must understand character sets if they are to create applications that work around the world.
What are character sets?
Let’s start with the basics. Every character of a string is stored as some binary number in your computer. A character set is a map that maps the number in the computer with an actual glyph that you can recognize. For example, in the ASCII character set, the code for ‘a’ is 97.
Back in the day there was ASCII (well there were more before that, but let’s not go too far back in history). ASCII used numbers from 0-127 to map certain control characters (like newlines) and English letters and punctuation (a-z, 0-9, quotes etc). If you spoke English, then life was good.
But if you spoke a different language that used different characters (for example, accented characters), you were pretty much screwed.
To solve this problem, engineers in different parts of the world started to come up with different character sets. That is, they created character sets so they could represent glyphs relevant to their language.
In a pre-internet era, this wasn’t all too bad. But once people started sharing documents, the problem became all too clear. An American reading a report drafted in Russia would see a page of garbled text because the character set used on his American-bought computer was completely different from the character set used on the Russian-bought computer.
Every common character set agrees on the characters from 0-127 (thankfully!). But any character represented from 128 and up differ greatly. For example, on some early PC’s the code 219 represents an opaque block (used to create boxes and such in old apps). But on other computers that code represented a U with a circumflex instead.
Further complicating the problems, the size of the code was sometimes different. In Asian alphabets, there are thousands of characters. That means there are too many codes to fit in the usual 1 byte (0-255) that most computers were using. To overcome this problem, special character sets were created in which some characters took 1 byte, and other characters took 2 bytes.
This might not seem so bad, but think about it for a second. The most common tasks like finding the length of a string now becomes more difficult. You can’t simply count the number of bytes the string occupies because that would give you an erroneous number (you can’t know how many characters are 1 bytes and how many are 2 bytes). And even things like iterating through a string a letter at a time is not a simple matter of moving a memory pointer up by 1 byte.
Unicode
Now that the problem is sufficiently explained, let’s get to the solution.
Unicode was created as the one character set to rule them all. In Unicode, every possible glyph imaginable is mapped to a unique number.
Unicode determines which number, called a code point, represents which character. The standard way to represent a code point is like this: U+0097 for the letter ‘a’. That is ‘U+[hex]‘.
But here’s where it gets a bit tricky. The Unicode standard only defines which numbers map to which characters. The actual means of storing the number is still up in the air.
Most people think that Unicode is limited to a maximum of 16 bits. But this is untrue: Unicode is limitless. It’s limitless because Unicode itself doesn’t care about storage, it only cares about mapping code points to characters. So Unicode can theoretically represent an unlimited amount of languages.
Encodings
So if Unicode only cares about code points, how does a computer store the actual data? This done done with a specific encoding.
So far we’ve been using the term ‘character set’ to mean two things: Which characters are in the set and how they are represented. Basically by representing text in a computer there are two components:
- A so called character repertoire which defines which characters can be represented. ASCII’s character repertoire says we can represent English letters, numbers and punctuation (amongst others). Unicode on the other hand can represent everything.
- A character encoding states how the computer actually stores these characters. It defines how a bunch of binary bits can be converted into the actual characters. With ASCII, we know that a character is always stored in 1 byte, and that ‘a’ is 97 and ‘b’ is 98 etc. With Unicode, there are multiple encodings.
The most popular encoding you hear about today is UTF-8. UTF-8 uses 1 to 4 bytes to store each code point. For code points 0-127, 1 byte is used. As the code points increase, the bytes it takes to represent them increases up to a maximum of 4 bytes.
Since Unicode defines code points 0-127 the same as ASCII (for example, ‘a’ is code point 97, just like ‘a’ in ASCII is 97), and UTF-8 encodes code points 0-127 as a single byte — UTF-8 is directly backwards compatible with ASCII. For most English systems, this makes converting to UTF-8 very very easy. For example, if you have a bunch of English web pages, you could switch the character encoding header on your web server to ‘UTF-8′ without any work. If your systems contain different languages (and thus, non-ASCII characters), then converting to UTF-8 will require other tools.
When character sets are important?
In short: ALL THE TIME
There is no such thing as text without a character set. There always needs to be a way for a computer to convert the random bits and bytes in a file to characters that humans can understand. Most of the time when you read about “plain text”, it means a file using ASCII code-points 0-127.
How programs decode characters in files depends. For example, some browsers might guess by looking at common patterns of code points. But most web sites these days will specify the character set right in the response headers, so a browser doesn’t need to guess.
If you have ever opened up a web page or a file and seen a bunch of ???’s, then it means the application is trying to decode the text data using an incorrect character set.
Conclusion
I hope you learned something about character sets today. In an upcoming article I will write about the role character sets and Unicode play in our applications, specifically with PHP.
Did you enjoy this post? Why not leave a comment below and continue the conversation, or subscribe to my feed and get articles like this delivered automatically each day to your feed reader.

Trackbacks & Pingbacks
[...] that I’ve already covered what Unicode is in another post, it’s time to talk about actually using it. Today I’ll talk about how to create PHP applications [...]
Comments
Leave a comment
Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>