www.cloford.com Home  |  About    
 
 

 

Character Sets

A character set is a list of characters that may appear in a document, and a character encoding is a way of storing these characters on a computer as bits.

 

When developing HTML documents you must specify the encoding you wish to use, e.g.

 

For HTML5, the default character encoding is UTF-8.  This should be declared in the HTML <meta> tag as:

      <meta charset="UTF-8">

 

For HTML 2.0 to HTML 4.01, the default character encoding is ISO-8859-1.  This should be declared in the HTML <meta> tag as:

      <meta http-equiv="Content-type" content="text/html; charset=iso-8859-1">

Another common character encoding is Windows-1252 (also known as ANSI).  Windows-1252 is almost identical to ISO-8859-1, but includes an additional 32 displayable characters.  Because many web developers have mistakenly used Windows-1252 characters in ISO-8859-1 documents, most browsers now change the encoding to Windows-1252 even when ISO-8859-1 is declared.

 

Character references

Character references allow web authors to refer to characters using either:

  • a symbolic name (character entity references) or
  • their number, as specified in the document character set (numeric character references).

 

Character entity references

Character entity references allow you to use a simple, memorable name instead of a number to refer to a character.  

The benefits are: 

  1. People find names easier to remember than numbers. (e.g. "quot" is more memorable than 34, which is the number used to represent a quotation mark in a numeric character reference). 
  2. Browsers handle character entity references more reliably than numeric character references, as character entity references can refer to a character without making assumptions about the character set or encoding. 

The disadvantages are: 

  1. Some of the names are difficult to remember, and it can even be difficult to decipher what they represent from their description (e.g.  "raquo" stands for "single right-pointing angle quotation mark").
  2. Some browsers (e.g. early versions of Netscape) will not understand all the character entity references specified in HTML 4.0.

 

The first character entity references were introduced with HTML 3.2 for ISO Latin-1 characters.  In HTML 4, the list was extended to include symbols, mathematical symbols and Greek letters plus markup-significant and internationalisation characters.

 

The syntax for a character entity reference is an ampersand (&) followed by the name of the entity, followed by a semi-colon (;) , e.g.:

 

<p>&copy; Cloford.com.</p>

is displayed as:

© Cloford.com.

 

Numeric character references

Numeric character references use a number to refer to a character in the document character set. The number can be either a decimal or hexadecimal number.

The benefits are: 

  1. Numeric character references will be displayed by all browsers that conform with HTML 2 specifications, unlike character entity references which were only introduced with HTML 3.2. 

The disadvantages are: 

  1. The numbers used for numeric character references are more difficult to remember than the simple symbolic names used in character entity references. 
  2. Numeric character references can cause problems with browsers that are not properly internationalised.

 

The syntax for numeric character references is an ampersand and a hash mark (&#), followed by a number in decimal, or the letter "x" and a number in hexadecimal, followed by a semi-colon (;) , e.g.:

 

<p>&#8240; or &#x2030; displays the "per mille sign".</p>

This is displayed on the screen as:

‰ or ‰ displays the "per mille sign".