Jan 5, 2012

Java: Character Encoding in char[] and byte[]

We were dealing with an issue at work where some emails going to Japanese customers were coming through as jibberish. So I went into the code and noticed that, in the processing, at one point the email body gets converted to a byte array and then back as a String. Suspecting that was the problem, I found myself wondering about java's byte[] and the more familiar char[]. There are methods in the java class library for java.lang.String that convert into both  (getBytes() and toCharArray()), so what's the difference between these two primitive types?

The more I thought about it, the more it made sense that the byte array was going to be the issue. At first glance (especially for those of us in the west), a char would seem to be essentially the same as a byte. But that was before I gave more consideration to the incredibly complex and amazing world of character encoding. A byte is by definition 8 bits, so only character sets that use 8 bits can be represented as a byte - limiting you to 255 characters. The primitive data type char on the other hand (and I had to look this up) is a single 16-bit unicode character - two bytes long - and 16 bits is enough to represent 65535 unique characters - plenty for, say - the japanese character set.

To test this theory, I wrote a little program that read a string of japanese text from a file (in Unicode), then converted it to both a byte[] and a char[], then converted those back to two Strings and compared them against the original.

public void testJapaneseCharactersInIsolation() throws Exception {
    \\ jap.txt contains a sentence or two of japanese text.
    FileInputStream file = new FileInputStream ("c:\\jap.txt");
    \\ this would look at my locale and by default try it as UTF-8 
    \\ so we'll instead specifically tell it to read it as UTF-16.
    InputStreamReader isr = new InputStreamReader(file, "UTF-16");
    BufferedReader reader = new BufferedReader(isr);

    String japanese = reader.readLine();
    char[] chars = japanese.toCharArray();
    String japaneseChars = new String(chars);

    byte[] bytes = japanese.getBytes();
    String japaneseBytes = new String(bytes);

    if (japanese.equals(japaneseChars)) {
        System.err.println("chars equals original");
    } else {
        System.err.println("chars fails");
    if (japanese.equals(japaneseBytes)) {
        System.err.println("bytes equals original");
    } else {
        System.err.println("bytes fails");

And the output:

chars equals original
bytes fails

Internally, we're actually using ByteArrayOutputStream without specifying any sort of encoding on the toString(), so the effect is as if we were just doing a String foo = (bar.getBytes()) and expecting foo to equals the String bar - which it will for UTF-8 - but not for Unicode. Of course, Sun (now Oracle) is smart enough to assume people like me will want to use ByteArrayOutputStream for Japanese characters, and they support an encoding specification in the toString method of the library. The next step for me would be to write some business logic that specifies the proper encoding for the dialect in use (or, preferably, find a single encoding that will work for everyone.)

By the way - Joel Spolsky is much smarter than me and has written a great article educating developers who introduce bugs by only supporting 8-bit character encoding. It's called "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" http://www.joelonsoftware.com/articles/Unicode.html.