Character encodings by example

Posted on 20.03.07

0



In this text I asume you have a basic understanding of character sets. Take a look at the reference section if you need to take a look into that area. You can use the code charts at http://www.unicode.org/charts/ to see the code points of the characters we will use. I use Microsoft Calculator to convert between hex, decimal and binary numbers.

In this example we will write a Java program that writes bytes to text files. The files will consist of characters encoded in ISO–8859–1 (Latin–1) and the UTF–8 encoding of the Unicode character set. We will use Microsoft Notepad, to view these files.

The Java program

The program below takes a file name and binary strings as arguments. Each binary string represents one byte in the text file, so make sure not to exceed a string length of 8 to get expected results. If you want to you can modify the main method to use decimal numbers instead.

import java.io.*;

public class Bits {

    public static void main(String[] args) throws Exception {
	FileOutputStream out = new FileOutputStream(args[0]);
	for (int i = 1; i < args.length; i++) {
	    int bits = Integer.parseInt(args[i],2);
	    System.out.print("Byte " + i + ": " + args[i] + " ");
	    System.out.print(bits + " ");
	    System.out.println(Integer.toHexString(bits));
	    out.write(bits);
	}
	out.close();
    }
}

Example 1

Find the character A in the Basic Latin code chart. Notice it has the hexadecimal code point 0041 or if converted the binary code point 1000001. Lets write a byte with this value to the text file example1.txt:

java Bits example1.txt 1000001

If you open the file with Microsoft Notepad you will see that it contains the letter A as expected. If you pick save as from the menu you can see that notepad suggests the ANSI encoding. ANSI is an extended ASCII encoding, as is Latin–1. ANSI and Latin–1 has differences, but we will use characters that are encoded the same way for both encodings. So just think of ANSI as Latin-1 throughout the text.

Example 2

Click the save as option in the notepad menu. Save example1.txt as example2.txt with the UTF–8 encoding. Take a look at the file properties of the newly created file. Notice that it has a size of 4 bytes, but if you open it again it still contains the lonely letter A. The 3 new bytes are located at the beginning of the file, and is the Byte Order Mark (BOM) for the UTF–8 encoding. The BOM is the only way for notepad to know that this is a text document encoded in UTF–8, and not Latin–1. This is because the character A is encoded the same way for both encodings. Let’s create example2.txt manually:

java Bits example2.txt 11101111 10111011 10111111 1000001

Example 3

Let’s try to encode a character that is encoded differently in the two encodings. Below we make three text files with the Norwegian letter Å. One encoded in Latin–1, the second in UTF–8 and the last one in UTF–8 too, but without the BOM. The Latin-1 chart tells us that Å has the hexadecimal code point 00C5. If we convert it using Microsoft Calculator we get the binary string 11000101 or the decimal number 197.

java Bits example3-1.txt 11000101
java Bits example3-2.txt 11101111 10111011 10111111 11000011 10000101
java Bits example3-3.txt 11000011 10000101

Notice that notepad recognizes the UTF–8 encoded Å in example3-3.txt even though we left out the BOM. Also notice that we need two bytes to encode Å in UTF–8. Some characters in UTF–8 is even encoded in four bytes. The combination of the two bytes 11000011 and 10000101 is decoded into the code point 197 when read. 197 references the letter Å for both the Unicode and the Latin–1 character set.

Now I hope you have a better understanding of how characters are encoded. If you want to know more you could take a look at the suggested readings. I personally recommend the XML in a Nutshell book. It’s almost everything you need.

Suggested readings

Advertisements
Posted in: Uncategorized