The UTF-8 character encoding, XML and PHP5

Posted on 11.07.06

0



In this article you will get to know a litte about how character sets, XML and PHP5 work together. We will also look a bit deeper into the most common character sets and encodings.

UTF-8

First we need to talk a little so that you will be able to grasp the two terms:

  • character set
  • character encoding

Data is stored on disk as bits. To get a character out of a bit sequence the computer:

  1. first reads a number of bits,
  2. decodes the bits to get a code point and
  3. maps the code point into a character

For instance:

  1. the fetched bit sequence 1100001
  2. could be decoded into code point 97
  3. which maps to the character a using the Unicode character set.

It’s also common to express code points in hex so for instance the arabic character ف has a 641 hex and a 1601 decimal code point.

Now it should be obvious that character encodings maps bits to code points while character sets maps code points to characters. Unicode has a lot of different encodings, but ISO–8859–1 (Latin–1) has only one. UTF–8 is one of the Unicode encodings. While Unicode and Latin–1 have an equal character map below code point 255 the UTF–8 encoding is different from Latin–1 from code point 128 and up.

XML and PHP5

Use the following XML declaration:

<?xml version="1.0" encoding="ISO-8859-1?>

to tell your software that your XML file has a Latin–1 encoding. The default is UTF–8 so if you omit the encoding part, you are really saying that your file contains UTF–8 characters.

Opera, Firefox and Internet Explorer knows how to display Unicode characters represented in the UTF-8 encoding. The problem is that web servers like Apache has a default Latin–1 content header. Browsers then think they are reading Latin–1 and displays characters, which have different encodings like å, wrong. The following PHP code changes the header so that browsers know that it’s UTF–8 they are receiving:

header('Content-Type: text/html; charset=utf-8');

Suggested readings:

Advertisements
Posted in: Uncategorized