www.cloford.com Home  |  About    
 
 

 

The Unicode Standard

 

Unicode is a computing industry standard for the consistent encoding, representation, and handling of the world's vast array of text characters and symbols. The latest version of Unicode contains a repertoire of more than 110,000 characters covering 100 scripts and multiple symbol sets.

Within the Unicode Standard every character has a unique reference number which enables the characters to be displayed on any platform, with any program and in any language. 

Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 and UTF-16. UTF stands for Unicode Transformation Format.

UTF8 can be from 1 to 4 bytes long and can represent any character in the Unicode standard. UTF-8 is backwards compatible with ASCII and is the preferred encoding for e-mail and web pages.

UTF-16 is a variable-length character encoding that is capable of encoding the entire Unicode repertoire. UTF-16 is used in major operating systems and environments, such as Microsoft Windows, Java and .NET.

HTML4 supports UTF-8. HTML5 supports both UTF-8 and UTF-16. However, the default character encoding in HTML5 is UTF-8.

If an HTML5 web page uses a different character set other than UTF-8, this should be declared in the <meta> tag as, for example:

<meta charset="ISO-8859-1">