Java-Supplementary Characters and UTF-16 Encoding

Supplementary Characters and UTF-16 Encoding

In the past, all Unicode characters could be held by 16 bits, which is the size of a char (2 bytes), because those values ranged from 0 to FFFF(0 to 65,535). When the unification effort started in the 1980s, a fixed 2-byte width code was more than sufficient to encode all characters used in all languages in the world, with room to spare for future expansion-or so everyone thought at the time. In 1991, Unicode 1.0 was released, using slightly less than half of the available 65,536 code values. Java was designed from the ground up to use 16-bit Unicode characters, which was a major advance over other programming languages that used 8-bit(1 byte) characters.

Unfortunately, over time, the inevitable happened. Unicode grew beyond 65,536 characters, primarily due to the addition of a very large set of ideographs used for Chinese, Japanese, and Korean. Now, the 16-bit char type is insufficient to describe all Unicode characters. We need a bit of terminology to explain how this problem is resolved in Java, beginning with Java SE 5.0. A code point is a code value that is associated with a character in an encoding scheme. In the Unicode standard, code points are written in hexadecimal and prefixed with U+, such as U+0041 for the code point of the letter A. Unicode has code points that are grouped into 17 code planes. The first code plane, called the basic multilingual plane(BMP), consists of the “classic” Unicode characters with code points U+0000(0) to U+FFFF(65,535). Sixteen additional planes, with code points U+10000(65,536) to U+10FFFFF(1,114,111), hold the supplementary characters.

The UTF-16 encoding is a method of representing all Unicode code points in a variable-length code. The characters in the basic multilingual plane are representing as 16-bit values(2 bytes), called code units. The supplementary characters are encoded as consecutive pairs of code units. Each of the values in such an encoding pair falls into a range of 2048 unused values of the basic multilingual plane, called the surrogates area(U+D800(55296) to U+DBFF(56319) for the first code unit(high surrogate), U+DC00(56320) to U+DFFF(57343) for the second code unit(low surrogate)). This is rather clever, because you can immediately tell whether a code unit encodes a single character or whether it is the first or second part of a supplementary character. For example, the CJK(Chinese, Japanese and Korean) character for the set of integers 𨄗 has code point U+28117(164119) and is encoded by the two code units U+D860(55392) and U+DD17(56599).


UTF-16 Encoding Algorithm

To encode U+28117(164119) (𨄗) to UTF-16 :

  1. Subtract 0x10000(65536) from the code point, leaving 0x18117(98583).
  2. For the high surrogate, shift right by 10 (divide by 0x400(1024)), then add 0xD800(55296), resulting in 0x0060(96) + 0xD800(55296) = 0xD860(55392).
  3. For the low surrogate, take the low 10 bits (remainder of dividing by 0x400(1024)), then add 0xDC00(56320), resulting in 0x0117(279) + 0xDC00(56320) = 0xDD17(56599).

UTF-16 Encoding Algorithm

UTF-16 Decoding Algorithm

To decode U+28117(164119) (𨄗) from UTF-16 :

  1. Take the high surrogate 0xD860(55392) and subtract 0xD800(55296), then multiply by 0x400(1024), resulting in 0x0060(96) * 0x400(1024) = 0x18000(98304).
  2. Take the low surrogate 0xDD17(56599) and subtract 0xDC00(56320), resulting in 0x0117(279).
  3. Add these two results together 0x18117(98583), and finally add 0x10000(65536) to get the final decoded UTF-32 code point, 0x28117(164119).

UTF-16 Decoding Algorithm

Supplementary Character Handling Methods

The Character class encapsulates the char data type. For the J2SE release 5, many methods were added to the Character class to support supplementary characters. The static toChars(int codepoint) method converts the specified character (Unicode code point) to its UTF-16 representation stored in a char array. If the specified code point is a BMP (Basic Multilingual Plane or Plane 0) value, the resulting char array has the same value as codePoint. If the specified code point is a supplementary code point, the resulting char array has the corresponding surrogate pair. The static toCodePoint(int highs, int lows) method convert and returns the specified parameters to its supplementary code point value.



Program Source

public class Javaapp {
    public static void main(String[] args)  {
        char getchar[] = Character.toChars(164119);
        System.out.println("164119 -> "+String.valueOf(getchar));
        char setchar[] = new char[2];
        setchar[0] = 55392;
        setchar[1] = 56599;
        System.out.println("164119 -> "+String.valueOf(setchar));
        int getcodepoint = Character.toCodePoint(setchar[0],setchar[1]);
        System.out.println("𨄗 -----> "+getcodepoint);

Leave a Comment