UTF-8 Encoding

Java-UTF-8 Encoding

UTF-8 Encoding

Since every Unicode character is encoded in exactly two bytes, Unicode is a fairly simple encoding. The first two bytes of a file are the first character. The next two bytes are the second character, and so on. This makes parsing Unicode data relatively simple compared to schemes that use variable-width characters. The downside is that Unicode is far from the most efficient encoding possible. In a file containing mostly English text, the high bytes of almost all the characters will be ‘0’. These bytes can occupy as much as half of the file. If you’re sending data across the network, Unicode data can take twice as long.

UTF8
A more efficient encoding can be achieved for files that are composed primarily of ASCII text by encoding the more common characters in fewer bytes. UTF-8 is one such format that encodes the non-null ASCII characters in a single byte, characters between 128 and 2047 and ASCII null in two bytes, and the remaining characters in three bytes. While theoretically this encoding might expand a file’s size by 50%, because most text files contain primarily ASCII, in practice it’s almost always a huge savings. Therefore, Java uses UTF-8 in string literals, identifiers, and other text data in compiled byte code.

To better understand UTF-8, consider a typical Unicode character as a sequence of 16 bits, Each ASCII character except the null character (each character between 1 and 127) has its upper nine bits equal to ‘0‘, Therefore, it’s easy to encode an ASCII character as a single byte. Just drop the high-order byte :

utf8 single byte encodingNow consider characters between 128 and 2047. These all have their top five bits equal to 0. These characters are encoded into two bytes, but not in the most obvious fashion. The 11 significant bits of the character are broken up :

utf8 two byte encoding

The remaining characters have values between 2048 and 65,535. Any or all of the bits in these characters may take on either value or 1. Thus, they are encoded in three bytes, like this:utf8-three-byte-encoding
Within this scheme, any byte beginning with a bit must be a single-byte ASCII character between 1 and 127. Any byte beginning with the three bits 110 must be the first byte of a two byte character. Any byte beginning with the four bits 1110 must be the first byte of a three byte character. Finally, any byte beginning with the two bits 10 must be the second or third byte of a multibyte character.

Leave a Comment