Programming zazen: Unicode for the Half-initiated (Perl biased)

Background

Perl

Background

There are under 7 000 languages, about one third of which have a writing system. The challenge is to be able to represent all writing systems using one encoding set.

And at the beginning was ASCII ...

("American Standard for Information Interchange", 1963)

ASCII, representing 128 characters, 33 non-printable control characters and the rest used for encoding of the English alphabet. ASCII developed from telegraphic codes. Its characters are encoded into 7-bit binary integers (with most significant bit being 0), giving the total of 128 possibilities. Using 8 bits extends the range to 255 characters. One of the encodings covering this range (called extended ASCII) is latin-1/ISO-8859-1.

After computers spread to other countries, other encodings were needed to represent characters in other languages, not available in ASCII. Western Europe uses Latin-1 (ISO-8859-1), Central Europe Latin-2 (ISO-8859-2) etc.

These local character sets are limited in the ability to provide character representations. So, a Unicode Consortium was created in 1991 in the attempt to unify all character representations and provide one encoding that would be able to represent any writing system. A collection of all known characters was started. Each character was assigned a unique number, called code point.

The code point is usually written as a four or six digit hex number (eg U+07FF). Some characters have a user-friendly name, like WHITE SMILING FACE (☺) or SNOWMAN (☃). Apart from base characters like A etc, there are accents and decorations (umlaut etc). A character followed by an accent, forming a logical character, is called a grapheme. Unicode is an umbrella for different encoding forms: UTF-8, UTF-16 and UTF-32. UTF-8 is the most popular encoding, at the beginning of 2015 used on around 82% of World Wide Web pages.

Picture of how characters map to bytes.

http://www.w3.org/International/articles/definitions-characters/images/encodings-utf8.png

Originally it was assumed 16 bits to represent one character, giving 16 536 (2¹⁶ ) options, would suffice. However soon the ambition was to be able to represent all possible writing systems, so more bytes were needed for character representations. The first 65 536 code points is called a Basic Multilingual Plane (BMP). There are 16 more Multilingual planes designed to hold over 1,000,000 of characters. These planes are not contiguously populated, leaving blocks of code points for future assignment.

http://rishida.net/docs/unicode-tutorial/images/unicode-charset2.png

UTF-8 Encoding

Number First Last Bytes
of bits code point code point
-----------------------------------------------------------------------------------------------------

7	U+0000	U+007F	`0xxxxxxx`
11	U+0080	U+07FF	`110xxxxx`	`10xxxxxx`
16	U+0800	U+FFFF	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
21	U+10000	U+1FFFFF	`11110xxx`	`10xxxxxx`	`10xxxxxx by`	`10xxxxxx`

UTF-8, unlike UTF-16, is a variable length encoding, where different code point ranges are represented by 1 byte or a sequence of 2,3 or 4 bytes. The first 128 characters are equivalent to ASCII. These have the higher order bit 0. Code points represented by more bytes, have the higher bit 1, followed by as many 1s as there are remaining bytes representing the given character. This is how system can understand the octet stream and decode it into characters.

Encode : into binary string
Decode: into character string

If the system cannot interpret a sequence of octets, because it assumes a wrong encoding, a warning about a wide character is given and a placement character is used. The solution is to encode the string into the desired encoding, then decode into a character string.

BOM and surrogates

UTF-16 and UTF-32 use 2 and 4 bytes respectively for character representation and need to deal with the endianness/the byte order, associated with the particular processor. Big endian order: most significant bits stored first vs little endian. BOM (Byte Order Mark) is a short byte sequence (U+FEFF or U+FFFE code points), present at the beginning of text etc, that clarifies the byte order information (which byte is the first one?) and allows correct decoding. UTF-8 does not suffer from the endian problem.

UTF-16 uses 2 bytes for all character representations, even the first 255 characters. Two byte encoding tackles the BMP, ie 65 536 characters; higher code points correspond to surrogate pairs, two 16 bit units.

Perl

use utf8;                        # to be able to use unicode in variable names and literals
use feature "unicode_strings";   # to use character based  string operations
use open    ":encoding(UTF-8)";  # expect UTF-8 input and provide UTF-8 output
use charnames ":loose";          # to be able to use \N{WHITE SMILING FACE}
 
# already created filehandle 

binmode STDOUT, ':iso-8859-1';
binmode $fh,    ':utf8'; 

# Database access

# DBI

$dbh = DBI->connect($dsn, $user, $password,
                    { RaiseError => 1, AutoCommit => 0,  

                      mysql_enable_utf8 => 1 }); 

# DBIx::Class 

$self->{ schema } = Schema->connect("dbi:mysql:somedatabase", 
                                        'sqluser', 'scret',
                                       { mysql_enable_utf8 => 1 },
                        ) or die "Cannot connect to database";
 
# Catalyst model - DBIx::Class

 __PACKAGE__->config(
    schema_class => 'MyApp::Schema',
    
    connect_info => mysql{
        dsn => 'dbi:mysql:somedatabase',
        user => 'mysqluser',
        password => 'scret',
        mysql_enable_utf8 => 1,
    }   
);

Programming zazen

Search This Blog

Wednesday, 11 February 2015

Unicode for the Half-initiated (Perl biased)

Background

Perl

Background

And at the beginning was ASCII ...

UTF-8 Encoding

BOM and surrogates

Perl

No comments:

Post a Comment