ASCII

Java-ASCII and Unicode

ASCII Character Set

Early computers and programming languages were created mainly by English-speaking programmers in countries where English was the native language. They developed a standard mapping between code points 0 through 127 and the 128 commonly used characters in the English language (such as A–Z). The resulting character set/encoding was named American Standard Code for Information Interchange (ASCII). ASCII is a seven-bit character set, each ASCII character can easily be represented as a single byte, signed or unsigned. Thus, it’s natural for ASCII-based programming languages to equate the character data type with the byte data type. In these languages, such as C, the same operations that read and write bytes also read and write characters.

ASCIIUnfortunately, ASCII is inadequate for almost all non-English languages. It’s not contain any of the other thousands of non-English characters that are used to read and write text around the world. Because a byte can represent a maximum of 256 different characters, developers around the world started creating different character sets/encodings that encoded the 128 ASCII characters, but also encoded extra characters to meet the needs of languages such as French, Greek, and Russian. Many of these character sets are still used today, and much existing data is encoded in them.

The International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) worked to standardize these 8-bit character sets/encodings under a joint umbrella standard called ISO/IEC 8859. The result is a series of substandards named ISO/IEC 8859-1, ISO/IEC 8859-2ISO/IEC 8859-3 and so on. For example, ISO/IEC 8859-1 (also known as Latin-1) defines a character set/encoding that consists of ASCII plus the characters covering most Western European countries. Also, ISO/IEC 8859-2 (also known as Latin-2) defines a similar character set/encoding covering Central and Eastern European countries.

ISOIEC 8859
However, these character sets are still inadequate for many needs. For example, most character sets/encodings only allow you to create documents in a combination of English and one other language (or a small number of other languages). You cannot, for example, use an ISO/IEC character set/encoding to create a document using a combination of English, French, Turkish, Russian, and Greek characters. This and other problems are being addressed by an international effort that has created and is continuing to develop Unicode.

Unicode

Unicode is an international effort to provide a single character set that everyone can use. Unicode supports the characters needed for English, Arabic, Tamil, Cyrillic, Greek, Devanagari (Hindi), and many others. Java is one of the first programming languages to explicitly address the need for non-English text. It does this by adopting Unicode as its native character set. All Java chars and strings are given in Unicode. At the time of Java’s creation, Unicode required 16 bits. Thus, in Java char is a 16-bit(two-byte) type. The range of a char is 0 to 65,536. There are no negative chars. Since Java is designed to allow programs to be written for worldwide use, it makes sense that it would use Unicode to represent characters. The use of Unicode is somewhat inefficient for languages such as English, German, Spanish, or French, whose characters can easily be contained within 8 bits. But such is the price that must be paid for global portability.

unicode characters

The first 128 Unicode characters (characters through 127) are identical to the ASCII character set. 32 is the ASCII space; therefore, 32 is the Unicode space. 33 is the ASCII exclamation point, so 33 is the Unicode exclamation point, and so on. The next 128 Unicode characters (characters 128 through 255) have the same values as the equivalent characters in the ISO/IEC 8859-1(also known as Latin-1) character set, a slight variation of which is used by Windows, adds the various accented characters, umlauts, cedillas, upside-down question marks, and other characters needed to write text in most Western European languages. The first 128 characters in ISO/IEC 8859-1 are identical to the ASCII character set. Values beyond 255 encode characters from various other character sets.

Characters in Java are indices into the Unicode character set. They are 16-bit values that can be converted into integers and manipulated with the integer operators, such as the addition and subtraction operators. A literal character is represented inside a pair of single quotes. All of the visible ASCII characters can be directly entered inside the quotes, such as ‘a‘, ‘z‘, and ‘@‘. For characters that are impossible to enter directly, there are several escape sequences that allow you to enter the character you need, such as ‘ \‘  for the single-quote character itself and ‘ \n‘ for the newline character. There is also a mechanism for directly entering the value of a character in octal(base-8 number system) or hexadecimal(base-16 number system). For octal notation, use the backslash followed by the three-digit number. For example, ‘ \143‘ is the letter ‘c‘. For hexadecimal, you enter a backslash-u ( \u), then exactly four hexadecimal digits. For example, ‘ \u0064‘ is the ASCII ‘d’.

Program

Unicode Charactersoutput2

Program Source

public class Javaapp {
    
    public static void main(String[] args)  {
        
       char ch1 = 'A';
       char ch2 = 65;
       char ch3 = '\101';
       char ch4 = '\u0041';
       System.out.println("ch1 : "+ch1);
       System.out.println("ch2 : "+ch2);
       System.out.println("ch3 : "+ch3);
       System.out.println("ch4 : "+ch4);
    }
}

Leave a Comment