Search This Blog

Saturday 7 February 2015

Unicode problem - wide character in print (Perl solution)

PROBLEM

I encountered the problem of the wide character in print (appearing as the replacement character, question mark in a black diamond) when working on a Catalyst application, though this was not a Catalyst related issue.

I had, as far I know, all the important setup in place:

  1. Database:
    1. Database created with: 
      1. CREATE DATABASE IF NOT EXISTS xxxxx DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
    2. On the Catalyst side, connection to the database had the mysql_enable_utf8 flag set up
  2.  Code
    1. use utf8   
    2. use open ':encoding(utf8)'
    3. use feature 'unicode_strings'
    4. Template::Toolkit setup in Catalyst View:
      1.  _PACKAGE__->config(
            TEMPLATE_EXTENSION => '.tt',
            render_die => 1,
            WRAPPER    => 'wrapper.tt',
            ENCODING   => 'utf-8',
        );
Despite all this, I kept getting the infamous warning about 'wide character in print' for the word Elégant, though not for Nice™.  Very puzzling because:
  • I did not encounter this problem in a RESTful implementation of the functionality with JSON output
  • All seemed set up to correctly and consistently cater for unicode
  • Some unicode characters showed correctly, some not
SOLUTION

The issue was caused by Perl assuming the encoding of the text retrieved from the database was in utf8 (as instructed), though, for this particular character it was in latin-1. One of the bytes that the latin-1 encoding uses for this character, is a non-utf8 byte and the system does not know how to handle it. Hence the replacement character.
 
The solution was to encode the retrieved text into utf8:

use Encode qw(encode);
...
my $text = encode('utf8', get_from_database());


No comments: