Unicode

From DM

Jump to: navigation, search


Unicode is the international standard whose goal is to specify a code matching every character needed by every written human language, including many dead languages only in scholarly use, to a single unique integer number, called a code point.

Contents

Unicode Character Encoding Schemes (UCS and UTF)

The current standard for identifying individual characters independent of font or operating system is Unicode. In Unicode a "Unicode scalar value" aka "ISO/IEC 10646 code point" is a number which stands for an "abstract character", which is something like the Platonic idea of a character. This number itself must be distinguished from its representation in a specific stream of bits used by a computer. The way how a sequence of numbers should be represented in a sequence of bits to be written to a file, send over the Internet, passed to another application, etc. is called a "serialization".

The mapping from a character set definition to the actual code units used to represent the data is called a "character encoding form". A character encoding form plus a serialization is called a "character encoding scheme".

The most important character encoding schemes for Unicode are UCS-2, UCS-4, UTF-8, and UTF-16 ("UCS" stands for "Universal Character Set", "UTF" for "UCS transformation format"). UCS-2 covers only a subset of all possible Unicode scalar values; UCS-4 virtually allows to serialize larger numbers in a bit stream than are used as Unicode scalar values. UTF-8 and UTF-16 are covering exactly the range of possible Unicode scalar values.

An Example: Serializing Greek Upper-Case Omega in UCS-4

UCS-4 requires that each Unicode scalar value must be encoded in 32 bit, i.e. 4 byte by 8 bit (the reason for the '4' in the name). Different ways to order these bytes are allowed.

For example, the Unicode scalar value of Greek upper-case Omega is 937. This is 1110101001 in binary notation. "UCS-4 Big Endian" prefixes this with zeros to reach the required 32 bit. Thus, serializing Greek upper-case Omega in UCS-4 Big Endian results in 00000000 00000000 00000011 10101001 written to a data stream. Other versions modify the order of the bytes (groups of 8 bit) to be serialized, e.g. Greek upper-case Omega in UCS-4 Little Endian is 10101001 00000011 00000000 00000000.

Unicode and XML Character References

XML character references like "Ω" allow for the representation of Unicode scalar values which are not supported by the character encoding scheme used to serialize an XML document. They are a sort of meta-serialization. The character reference "Ω" advises an XML processor to replace this reference with a representation of the Unicode scalar value 937 (which stands for the Greek upper-case Omega abstract character). This is equivalent to the XML character reference "Ω" (and even "Ω" or "&x0000003A9;" as leading zeros may be added ad libitum). The 'x' in this character reference indicates that the number is of hexadecimal type (also called 'sedecimal' by those who do not like the mixture of Latin and Greek), while in the absence of an 'x' it is a decimal number. Numbers in the hexadecimal system make use of 16 basic symbols (0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F) representing the numbers from 0 to 15. The decimal number '16' is represented by hexadecimal '10', decimal '17' by hexadecimal '11', decimal '255' is hexadecimal 'FF', decimal '256' is hexadecimal '100', and decimal '937' is hexadecimal '3A9'.

In practice, XML character references had been superfluous, if only UCS-4, UTF-16 or UTF-8 would be used for serializing XML documents. The purpose of character references is to represent Unicode characters in an XML document serialized with a character encoding scheme, say US-ASCII, that does not include all Unicode characters. Some people consider it a major design flaws of XML that it provides such a mechanism only for the character data, but not for element and attribute names. For example, an XML document may use Greek upper-case Omega for the name of an element, but it is impossible to serialize this document using US-ASCII.

An Example: Serializing an XML Character Reference of Greek Upper-Case Omega in UCS-4

The XML character reference "Ω" serialized in UCS-4 Big Endian is just a concatenation of the individual characters '&' , '#' , '9', '3', '7', ';' (binary 100110, 100011, 111001, 110011, 110111, 111011):

00000000000000000000000000100110 00000000000000000000000000100011 00000000000000000000000000111001 00000000000000000000000000110011 00000000000000000000000000110111 00000000000000000000000000111011

That this represents Greek upper-case Omega is opaque to UCS-4. UCS-4 treats it just as the sequence of individual characters that constitute the character reference. It is a matter of a higher level protocol to interpret this as Greek upper-case Omega, for XML this is outlined in the XML 1.0 specification sec. 4.1:

CharRef  ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'
[...]
If the character reference begins with "&#x", the digits and letters up to the terminating ; provide a hexadecimal representation of the character's code point in ISO/IEC 10646. If it begins just with "&#", the digits up to the terminating ; provide a decimal representation of the character's code point.

(The first line in this quotation is a formal description that describes how to generate a character reference.)

Unicode Fonts

"Fonts" are again from a different level of abstraction. A font is a set of so-called "glyph images" used for the visualization of characters. A character may be represented by different glyph images, and the same glyph image might represent different characters.

Note that an individual character can be described by its Unicode scalar value without any information about a specific font. Conversely, all Unicode fonts that contain the glyph will associate it with the same value. In principle, users do not need to know which font was used when creating a given Unicode text. If they have a font that contains glyphs for all characters in the text, they can display it. (Of course in practice most fonts are no where near complete.)

Software Support for Unicode

Most contemporary Web browsers are able to display Unicode. Not all software, however, is as easy-going. Open Office accepts Unicode in UCS-2 and UTF-8, but not UCS-4; the digital library program Greenstone, likewise, prefers UTF-8.

Methods of Converting from UCS-4 to UTF-8

There are several methods of converting from UCS-4 to UTF-8.

Cut and Paste

If you are using a GUI (Graphic User Interface), the simplest method may be to load the file you wish to convert into a program that understands the current encoding (e.g. a contemporary browser), select and copy the text you wish to convert, and then paste the selection into the target program. A potential problem with this method, however, is that it may result in loss of other data, such as markup.[1] [2]

XSLT

A second, quite elegant, method for converting XML files is XSLT. Setting the "encoding" attribute to UTF-8 ensures that the output of the stylesheet is in the correct format because before any transformations take place an XSLT parser is required to convert to Unicode.[3]

Perl (and Other) Scripts

Perl and other scripting languages can be used to convert characters from one encoding method to another. You may find useful examples at http://www.mail-archive.com/linux-utf8@nl.linux.org/msg02741.html. [4]

Utilities and Software

Software also exists to handle the job. One example is Dieter Köhler's Open XML Editor http://www.philo.de/xmledit/.[5]

Useful Links

Personal tools