PHP and Unicode with UTF-8

Now that I've already covered what Unicode is in another post, it's time to talk about actually using it. Today I'll talk about how to create PHP applications that correctly handle Unicode.

Why use Unicode

I covered this in my previous article on Unicode, but let's recap. The reason to use Unicode is so you can create multilingual applications.

Without Unicode you basically have two ways to serve a multilingual site. I'll briefly explain each here.

Encoded HTML Entities

The first way is to store all data in one character set (for example, iso-8895-1 for English) but encode all other foreign characters as HTML entities. Most web developers have probably done this at one time or another -- for example, using © for the copyright symbol. But you can also encode any character this way by specifying the Unicode code point like so: π for π.

Common Uses:

Advantages

Disadvantages

Multiple Encodings

The second way is to store data in multiple encodings. Store English in ISO-8859-1, store Russian in ISO-8859-5 etc.

Common Use:

Advantages:

Disadvantages:

Using Unicode

Using Unicode suffers from none of the disadvantages describes above. Indeed, the very reason Unicode was created was to overcome those problems.

The most often used encoding for Unicode is UTF-8. This variable-length encoding stores Unicode characters in 1 to 4 bytes. There are other encodings, like UTF-16 or the depreciated UTF-7 -- but we'll only be talking about UTF-8 today. In fact, UTF-8 will probably be all you ever need.

Before we continue, I want to at least mention a few "problems" associated with using Unicode.

PHP and UTF-8

Declaring UTF-8

To start serving UTF-8 web pages, you need to send a content-type header. Web servers will automatically send a default header if you don't (which is why you may never have had to do this before). Most of the time the default header says the content is HTML, and is encoded in ISO-8895-1. Since you output HTML and your web pages are English -- this is almost always okay.

To start using Unicode, however, you need to either change the server config and change the default encoding or output it yourself from within PHP:

  1. header('Content-Type: text/html; charset=utf-8');

Also note that your HTML content might also be spouting a 'http-equiv' meta tag that features the encoding. You should make sure it has UTF-8 there, too:

  1. <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

A browser will always take the HTTP header over the one in the meta tag, but it doesn't hurt to be consistent.

Handling UTF-8

It's all good to declare your pages as UTF-8, but that's useless if you're data isn't actually encoded in UTF-8. So let's talk about how to work with UTF-8 strings.

First of all, let me mention that browsers will submit forms in the same encoding as the page. So since you declared you pages as being encoded in UTF-8, all user data will be in UTF-8 too. The real work comes with using PHP to validate and manipulate user input.

The Problem

PHP is pretty much completely unaware of UTF-8 except in some specific circumstances. Most of the time, PHP treats all text as a string of 1 bytes -- that is, ASCII. This has a number of consequences due to the way UTF-8 uses a variable-length encoding scheme, and PHP's lack of functionality to properly handle it.

PHP's string functions are made to work on 1 byte characters. So counting the length of a string is as simple as checking how many bytes the string consumes, or getting a specific character by providing an offset is as simple as returning the byte at the offset. But stick in a UTF-8 encoded string, and now you have single characters that take up 2 (or 3 or 4) bytes. The result: PHP's string functions don't see a 2 byte character, they see 2 separate 1 byte characters (which is, of course, incorrect).

The solution is to write your own string functions that know all about UTF-8. You need to think about which functions will require characters to be 1 byte -- those are the dangerous ones. So things like strlen() or strpos(). Others, like explode(), only search for a series of bytes (which can be UTF-8) and use it -- so they are okay.

Here's a list of some unsafe UTF-8 functions:

Libraries and Extensions

You're pretty insane if you want to go ahead and re-write all of the string functions yourself! Instead, there are a few ways you can safely handle UTF-8 strings without all the headbanging.

I like to the phputf8 library in all my projects. It's simple and clean and it works!

Well Formedness

It is possible that UTF-8 input can be ill-formed. This is because of the variable-length encoding. To remove any ambiguity when parsing UTF-8 there are certain bytes that are invalid. These invalid ranges are a design feature, they make it impossible for the computer to "mix-up" a two-byte character for two separate one byte characters (or visa-versa).

You don't need to really understand what I just said since it has to do with how UTF-8 is represented on your computer (read the Wikipedia article if you want to know more). All you need to understand is that UFT-8 text can be malformed, and thus, it's something you need to validate.

So when you are accepting form input, you should always make sure it's valid or else there's a possibility of various attacks. For example, imagine a UTF-F byte sequence is "AB", but an attacker modifies the request so only the first byte "A" is sent. Due to the way UTF-8 is encoded, your computer knows there should be a second byte after "A". So when you blindly output the single "A" into your HTML page, it's possible the next byte (whatever it may be) is "eaten up". Imagine if the next character is a double-quote intended to close off an HTML attribute or something like that. Possibility for XSS is introduced.

One very simple way to check for well formedness is to use preg_match() with the 'u' modifier:

  1. if (strlen($str) AND !preg_match('/^.{1}/us', $str)) {
  2.     die('Invalid UTF-8');
  3. }

Since any invalid UTF-8 string will result in the match failing, all you need to do is match a single character.

If you're using phputf8, you can use the utf8_compliant() or utf8_is_valid() functions. utf8_compliant() uses the method described above, but will also pass 5 and 6 byte sequences which aren't technically valid UTF-8 (5 and 6 bytes are not unsafe, however). utf8_is_valid() will make sure the string is actually valid UTF-8 but will take longer.

Regular Expressions

When using regex with UTF-8, you need to use the 'u' modifier. That's all it takes!

  1. if (preg_match('/myregex/u', $str)) {
  2.  
  3. }

Stick with ASCII when...

When possible. There are still a lot of places where Latin letters and numbers are the only valid input. For example, URL's and email addresses.

Also, working with ASCII will often be better performing than using UTF-8. Especially if you are using a PHP-based library like phputf8 which will be slower still then compiled libraries like mbstring at certain operations.

UTF-8 and MySQL

If you use MySQL out of the box, and PHP's mysql_* functions, then you can INSERT and SELECT data from MySQL and you're safe almost all the time. That is, you can INSERT a string of UTF-8 characters into a table that is marked as ISO-8895-1. This is because MySQL simply sees a string of 1 byte characters -- it doesn't actually know that your text is UTF-8.

You'll immediately see a problem if you use a tool like phpMyAdmin, though. Because phpMyAdmin will output the results to you in the same charset as the database table. You'll end up seeing garbled text.

Two other problems will quickly become apparent:

  1. Collation. Collation is how a language is sorted. For example, in English we go from A-Z.
  2. String functions. Using MySQL's string functions may garble your strings if the table charset and the data charset are different.

What you should do is make sure your tables are encoded in UTF-8, and that the default collation is UTF-8, and that your MySQL client connection is being interpreted as UTF-8.

To set the global character set and collation you can edit the MySQL configuration file:

  1. default-character-set=utf8
  2. default-collation=utf8_general_ci

This sets the charset to UTF-8 and the collation to case-insensitive UTF-8.

When creating a database or table, you can also supply these values:

  1. CREATE DATABASE `test` DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci
  2. CREATE TABLE `test` (...) CHARACTER SET utf8 COLLATE utf8_general_ci

Note that specific fields can also have their own charset and collation using the same syntax.

Once you connect to the MySQL server, you should issue a query to indicate that your queries will be in UTF-8:

  1. mysql_query("SET NAMES 'utf8'");

After that -- you're good to go.

Converting content to UTF-8

First, a note: Make sure you have enough storage for the new strings. As explained earlier, UTF-8 strings can possibly use up to 4 times the amount of storage space. This usually isn't much of a problem with content saved to disk. However, if you are storing content in a database, your fields like VARCHAR(255) may be much too small.

If you need to convert simple strings on the go, you should use iconv. For example, perhaps you need to convert an email message from Windows-1251 (common with Russian) to UTF-8:

  1. $content = iconv('Windows-1251', 'UTF-8', $old_content);

If you are using MySQL, you can convert entire tables of content quite easily by issuing ALTER TABLE queries. Note that your existing charset and collation set on the tables need to be correct or else the translation will garble your text! (Going back to my note above about how MySQL won't know what charset you INSERT text in).

  1. ALTER TABLE `table` CHARACTER SET utf8 COLLATE utf8_general_ci
  2. ALTER TABLE `table` CHANGE `table` `table` [TYPE] CHARACTER SET utf8 COLLATE utf8_general_ci

Converting UTF-8 content into another character set

Sometimes you might need to convert content back into another set. For example, if you've decided to switch to using UTF-8 but your app still communicates with a legacy system that only understands ISO-8895-1.

Again, you should use iconv which is perfectly suited for this kind of thing:

  1. $content = iconv('UTF-8', 'ISO-8895-1', $old_content);

Note that the characters represented in UTF-8 may not be able to be represented in the character set you want to convert to. For example, if you're content contained the copyright symbol, ISO-8895-1 won't have that symbol.

You have two options: You can either transliterate the characters that can't be converted or simply discard them.

Transliteration is the process of replacing the character with one that looks sort-of like it. For example, "é" might be replaced with "e". To achieve this, you append "//TRANSLIT" to the out-charset:

  1. $content = iconv('UTF-8', 'ISO-8895-1//TRANSLIT', $old_content);

On the other hand if you just want to get rid of the offending characters, simply append "//IGNORE" to the out-charset.

Conclusion

I hope you've learned a bit about using UTF-8 with PHP. It's really not all that bad once you take in all of the information -- definitely not as scary as you thought it'd be, I bet!

Additional reading:


Did you enjoy this post? Why not leave a comment below and continue the conversation, or subscribe to my feed and get articles like this delivered automatically each day to your feed reader.

No comments yet.

Leave a comment

Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

(required)

(required)