Search This Blog

Tuesday 15 December 2015

Golang Tips and Tricks


Installation of Go crypto/ssh module (Ubuntu 15.04)

 

go get -v code.google.com/p/go.crypto/ssh

code.google.com/p/go.crypto (download)
go: missing Mercurial command. See http://golang.org/s/gogetcmd
package code.google.com/p/go.crypto/ssh: exec: "hg": executable file not found in $PATH

This failure in downloading the crypto/ssh package is due to the fact the crypto/ssh repo was moved to golang.org/x/crypto/ssh.

go get -v  golang.org/x/crypto/ssh

Fetching https://golang.org/x/crypto/ssh?go-get=1
Parsing meta tags from https://golang.org/x/crypto/ssh?go-get=1 (status code 200)
...
...
# golang.org/x/crypto/ssh
/home/tamara/.gvm/pkgsets/go1.3.3/global/src/golang.org/x/crypto/ssh/keys.go:492: undefined: crypto.Signer

This error is fixed by upgrading Go to at least 1.4.

go get -v  golang.org/x/crypto/ssh

golang.org/x/crypto/curve25519
golang.org/x/crypto/ssh

Setting $GOPATH (golang workspace) when installing Go with gvm

 

Ubuntu comes with an obsolete Go version. We can use gvm, Go version manager, to install a higher Go version by following steps in the http://www.hostingadvice.com/how-to/install-golang-on-ubuntu/

Starting to use a particular Go version (gvm use go1.4.2) resets some of the Go environment variables:

go env

GOARCH="amd64"
GOBIN=""
GOCHAR="6"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/tamara/.gvm/pkgsets/go1.4.2/global"
GORACE=""
GOROOT="/home/tamara/.gvm/gos/go1.4.2"
GOTOOLDIR="/home/tamara/.gvm/gos/go1.4.2/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="g++"
CGO_ENABLED="1"

gvm pkgset list

gvm go package sets (go1.4.2)

=>  global

gvm pkgset create tamarakaufler
gvm pkgset use tamarakaufler

=> Now using version go1.4.2@tamarakaufler
gvm pkgset list
gvm go package sets (go1.4.2)

    global
=>  tamarakaufler
 
gvm pkgenv tamarakaufler
opens the default editor for configuration of the 
project specific workspace. Edit the lines 12 
and 16 accordingly (bold parts)  

# line 12
export GOPATH; GOPATH="/home/tamara/.gvm/pkgsets/go1.4.2/tamarakaufler:$HOME/programming/go:$GOPATH"

# line 16
export PATH; PATH="/home/tamara/.gvm/pkgsets/go1.4.2/tamarakaufler/bin:${GVM_OVERLAY_PREFIX}/bin:$HOME/programming/go/bin:${PATH}"

Sunday 1 November 2015

Literals in Go

BACKGROUND

  • Go source code is written in Unicode characters, encoded in UTF-8
  • Literals represent fixed, ie constant, values
  • There are two types of literals in Go, related to textual context: rune literals and string literals.

RUNE literals

Rune is an integer (uint32, 4 byte binary number), representing a Unicode code point, a unique identifier of a character within a particular encoding. In UTF-8, the most common Unicode encoding, a code point can represent a sequence of one to 4 bytes. ASCII (representing the old English and a group of unprintable characters) has 128 code points; extended ASCII, representing most Western languages 256.

Rune literal is expressed as one or more characters in single quotes, excluding unquoted single quotes and newlines.

STRING literals

String literal is a concatenation of characters, a character sequence. There are two types: interpreted string literals and raw string literals.
Interpreted string literals are enclosed in double quotes with any character allowed, except for unquoted double quote and newline.

Raw string literals are enclosed in back quotes. The character sequence can contain newlines and backslashes have no special escaping effect. Any carriage returns (\r) within the literal are stripped from the raw string.

Example


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
package main

import (
 "log"
 "net/http"
)

func main() {

 http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
  w.Write([]byte('<html><head>
    <title>Chatting away</title>
    </head><body>Go and chat!</body></html>
    '))
 })

 if err := http.ListenAndServe(":3333", nil); err != nil {
  log.Fatal("Starting server : ", err)
 }

}

The above code produces errors on lines 11 and 14: rune literal not terminated.

Keeping the single quotes and putting the whole string value on one line results in:  illegal rune literal. Single quotes are reserved for characters, so no surprise there.

Changing the single quotes to double quotes still results in error, this time: string literal not terminated. You may find it puzzling, just as I did, until you realize this is an interpreted string literal (because enclosed in double quotes) and recall these cannot contain newlines.

There are two options here:
  1. Put the whole string value on one line:
            "<html><head><title>Chatting away</title></head><body>Go and chat!</body></html>"
  1. Use back quotes, ie use a raw string literal, rather than an interpreted one. This will allow to spread the string value over several lines, because the back quoted raw string literal removes the carriage return: 
             `<html>
              <head><title>Chatting away</title></head>
              <body>Go and chat!</body>
              </html>`

Thursday 26 March 2015

Arabic to Roman numerals conversion - Perl implementation

The implementation splits the integer into an array, whose elements are then processed using the same logic, with specifics of each decimal range of units, tens, hundreds and thousands, described in a configuration. The implementation works for any number. Numbers from 4000 upwards reuse configuration for the first 4 decimal ranges and the resulting Roman numeral(s) are decorated with * denoting multiplication by 1000 and | denoting multiplication by 1_000_000.

#!/usr/bin/perl 
#===============================================================================
#
#         FILE: arabic2roman.pl
#
#        USAGE: perl arabic2roman.pl  
#
#  DESCRIPTION: converts an arabic number into roman numerals
#               works for any number
#       AUTHOR: Tamara Kaufler (), 
#      CREATED: 25/03/15
#===============================================================================

use strict;
use warnings;
use utf8;
use v5.018;

use List::MoreUtils qw(any);
#use Data::Dumper    qw(Dumper);

# ------------- INITIAL SETUP ----------------

my %setup   = (
                leader      => 5,
                oddballs    => [4, 9],
);

## 'leader' and 'number' ... different for different decimal positions
## 'number' is appended/prepended to the 'leader'
##      first key  ... associated with the position of difits in a number to convert
##      second key ... corresponds to the above oddballs array indexes
##                     0: for digits up to 5
##                     1: for digits after 5
my %romans  = (
                0 => {  
                        number => 'I',
                        0  =>   {
                                   leader => 'V',
                                },
                },
                1 => {  
                        number => 'X',
                        0  =>   {
                                   leader => 'L',
                                },
                },
                2 => {  
                        number => 'C',
                        0  =>   {
                                   leader => 'D',
                                },
                },
                3 => {  
                        number => 'M',
                },
);

$romans{0}{1}{leader} = $romans{1}{number};
$romans{1}{1}{leader} = $romans{2}{number};
$romans{2}{1}{leader} = $romans{3}{number};
## for numbers >= 4000 
$romans{3}{0}{leader} = $romans{0}{0}{leader};
$romans{3}{1}{leader} = $romans{0}{0}{leader};

# ------------- INPUT ----------------

say STDIN "Give me a number to convert to Roman numerals, please:";
my $number = <STDIN>;
chomp $number;

say transform2roman($number);

# ---------- SUBROUTINES --------------

sub transform2roman {
    my ($number) = @_;

    ## reversing so that the array index conveniently matches the setup info
    my @digits = reverse split '', $number;

    ## roman numerals are pushed onto the arrays, then (reversed) concatenated
    my @numeral_parts = ();

    my @oddballs = @{$setup{oddballs}};
    my $leader   = $setup{leader};

    my $i=0;

    ## process each digit
    ##      push result onto an array
    ##      join to produce the result
    for my $digit (@digits) {

        ## skipping zero, nothing to do
        ## will be handled one level up
        do { $i++; next; } unless $digit;

        my $is_beyond = '';
        my $j = $i;

        ## for numbers > 4000
        ## is_beyond ... to mimics a bar shown about the high roman number when number > 4000
        ##               represents multiplication by 1000
        if ($i >= 3 && $digit >= 4) {
            my $plunge   = int($i/3);
            $j           = $i - $plunge * 3;
            $is_beyond   = ($plunge % 2) ? '*' : '|';
            $is_beyond   = $is_beyond x $plunge;
        }

        ## the digit is 5
        if ($digit == $leader ) {
            push @numeral_parts,
                 ($is_beyond, 
                  $romans{$j}{0}{leader},
                  $is_beyond);
        ## the digit is 4 or 9
        } elsif (any { $_ == $digit } @oddballs) {
            my $idx = ($digit == $oddballs[0]) ? 0 : 1;
            push @numeral_parts, 
                ($is_beyond, 
                 $romans{$j}{$idx}{leader}, $romans{$j}{number}, 
                 $is_beyond);
        } else {
            ## the digit is greater than 5
            if ($digit > $leader) {
                push @numeral_parts, $is_beyond;
                map { push @numeral_parts, $romans{$j}{number} } 1 .. $digit-$leader;
                push @numeral_parts, $romans{$j}{0}{leader};
                push @numeral_parts, $is_beyond;
            ## the digit up to 5
            } else {
                push @numeral_parts, $is_beyond;
                map { push @numeral_parts, $romans{$j}{number} } 1 .. $digit;
                push @numeral_parts, $is_beyond;
            }
        }
        $i++; 
    }

    join '', reverse @numeral_parts;
}

Sunday 8 March 2015

Parallelization examples in Perl

Background

If a process (running instance of a script/application) has just one path of execution, one main thread, it may be possible to speed up the execution time by performing some tasks in parallel (tasks processed at the same time on multicore systems) or concurrently (tasks making progress by CPU context switching on single core machines). Parallelization (meaning here NOT processing tasks in sequence) implemented within one process, can be achieved through:
While child processes are independent of the parent process, threads are parallel paths of execution within a process. This affects system resources used, shared data and communication. Processes need to use Inter-process Communication. Threads share the memory address space and other resources (file handles, network sockets, locks etc) within the process. Therefore can read from and write to the same variables and data structures and directly communicate. Asynchronous processing is used in one threaded processes to achieve parallelization through non-blocking approach. Task A is started, the flow however is not blocked by waiting for the task to finish before moving on. Instead, the processing carries on, being able to tackle other tasks (ie, these tasks can run in parallel with task A), before the flow returns to task A when the task is finished, possibly returning data etc. Asynchronous processing is achieved through various implementations of an event loop.

Code Examples

Child processes


The example below contains two implementation of forking child processes. The first one uses a CPAN module Parallel::ForkManager, the second one uses the basic forking. There are also two approaches to reaping of dead children. One through CHILD signal handler, one using waitpid. In long running processes, children may stay around even after finishing, causing a strain on system resources. Therefore it is important to ensure their removal/reaping.

#!/usr/bin/perl 

=head2 parallel_worker.pl

two implementations of forking child processes

=cut

use strict;
use warnings;
use utf8;
use v5.018;

use Parallel::ForkManager;
use POSIX":sys_wait_h";

use Time::HiRes qw(time);
use Data::Printer;

my $pm = Parallel::ForkManager->new(4);

=head3 Process all files in parallel

loops through all the files to be processed
creates/forks child processes
reaps deal child processes:
    reaping of dead child processes/zombies. Zombies are processes,
    that have finished execution, but remain in the process table,
    if the parent process need to inquire about the child process
    exit status. If, for some reason, the zombies are not removed
    from the process table (reaped, by reading the child status 
    through the wait system call), can lead to resource leaks.
2 implementations:
    a) with Parallel::ForkManager
    b) with fork:
        1) uses CHILD signal handler
        2) uses waitpid

=cut

my @files = qw(a b c d e f g);
my %child = ();

# creating child processes: implementation 1
# ==========================================

DATA_LOOP:
foreach my $data (@files) {
    
    # forks a new child process
    my $pid;
    $pid = $pm->start and say "... child $pid" and next DATA_LOOP;

    # what will be done in the child process
    # until ->finish is encountered
    sleep 3;

    # end the child process
    $pm->finish;
}

$pm->wait_all_children;
say ">>> DONE 1";

# creating child processes: implementation 2
# ==========================================

# child handler to reap dead children
# -----------------------------------
$SIG{CHLD} = sub {
    while ( (my $pid = waitpid(-1, WNOHANG)) > 0 ) {
        if (exists $child{$pid}) {
            delete $child{$pid};
            say "!!! deleted $pid";
        }
        return unless keys %child;
    }
};

foreach my $data (@files) {

    # create a child process
    # the flow execution goes until the 
    # end of the block
    my $pid = fork;

    # child process --------------------
    if ($pid) {
        say "* in the child process $pid";
        $child{$pid} = undef;
        sleep 3;
    } 
    # parent process
    elsif ($pid == 0) {
        # the parent process needs to exit
        # otherwise the flow execution will 
        # continue after the foreach loop
        # producing multiple 'DONE 2 statements'
        # instead of just one
        exit 0;
    }
    # failure to fork
    else {
        say "* failed to fork a process";
    }

    say "* still processing in child process $pid";
    # ----------------------------------
}

### reaping dead child processes without child signal handler
###     to use: comment out the CHILD gignal handler
###     and uncomment lines below
## ---------------------------------------------------
##while (keys %child) {
##    for my $key (keys %child) {;
##        my $pid = waitpid($key, WNOHANG);
##
##        if ($pid == -1) {
##            "\t>>> child $key does not exist";
##            delete $child{$key}; 
##
##            say "\t\t deleted key $key";
##        }
##
##        if ($pid == $key) {
##            delete $child{$key}; 
##
##            say "\t\t *** child $key reaped";
##            say "\t\t *** deleted key $key";
##        }
##        say ">>>--------------------------";
##    }
##}
##

p %child;
say ">>> DONE 2";

Threads

The task is to process files in a directory, that contain one number per line, and output the total sum of all numbers in all files.

The following script contains two implementations for comparison. Both use a job queue, a Thread::Queue object, holding jobs (information need to do a task). One job is to process one file, ie to calculate the sum of all numbers in that file. The implementations are using

a) a single thread
b) a pool of threads

There is a limit on the number of created threads. The threads work in parallel and take jobs off the job queue until there is no more work to be done. One job in this example is the calculation of the sum of all numbers in one file.

Taking a job off the job queue (->dequeue)  is non-blocking. This allows the flow to continue even when there are no more jobs in the queue. Blocking dequeuing would require to implement a mechanism that would deal with this scenario and allow the program to continue.

After a thread processes a file, it returns the file sum. After all threads are created, we join them/wait for them to finish and retrieve the partial sums, which we then process further.

#!/usr/bin/perl

use strict;
use warnings;
use utf8;
use v5.018;

use threads;
use Thread::Queue;

my ($t0_a, $t1_a, $t0_b, $t1_b, $td_a, $td_b);

use List::Util qw(sum);
use Data::Printer;
use Benchmark qw(timediff timestr);

local $|;
my $MAX_THREADS  = 5;
my $data_dir     = './test';

my %work_queue   = ();
my @results      = ();
my $files_count  = 0;

opendir my $dh, $data_dir || die "can't opendir $data_dir $!";

my @files        = grep { /a*\.txt/ } readdir $dh;
$files_count     = scalar @files;

closedir $dh;

say "\n*************************************************";
say "*** Jobs: one job == processed file ***";
say "*************************************************\n";
p @files;

# Job queue
my $q = Thread::Queue->new();

# Add jobs to the job queue
$q->enqueue(@files);

say "\n*************************************************";
say "*** One thread takes jobs off a job queue ***";
say "*************************************************\n";

say "Pending jobs:";
p $q->pending();

=head2 One thread

Each thread will take work off the work queue
while work is available

=cut

$t0_a = Benchmark->new;

my $thr = threads->create(
    sub {
        my $sum = 0;

        # Thread will loop until no more work
        #   using ->dequeue will block the execution
        #   when there are no jobs to be done, unless 
        #   another mechanism takes care of that
        #   and handles the empty job queue
        while (defined (my $file = $q->dequeue_nb())) {
            my $incr_sum = _get_file_sum("$data_dir/$file");   
            $sum += $incr_sum; 
        }
        return $sum;
    }
);

{
    my @thr_results = map { $_->join() } threads->list();

    $t1_a = Benchmark->new;
    $td_a = timestr(timediff($t1_a, $t0_a));

    p @thr_results;
    say "Done: sum is " . sum @thr_results;
    say "Run time = $td_a";
}

=head2 Thread pool 


To avoid the cost of creating new threads, we shall create
a thread pool and reuse threads that are available to do
more work 


=cut

say "\n*************************************************";
say "*** A pool of threads: each thread takes jobs off\nthe job queue while jobs are available ***";
say "*************************************************\n";

say "Pending jobs after the previous processing:";
p $q->pending();

# Send work to the thread
$q->enqueue(@files);

# Signal that there is no more work to be sent
$q->end();

say "Pending jobs:";
p $q->pending();

$t0_b = Benchmark->new;

# Lower the number of created threads if the number of jobs is lower than the
# allowed thread limit
$MAX_THREADS = ($MAX_THREADS > $files_count) ? $files_count : $MAX_THREADS;

say "\nCreating a pool of $MAX_THREADS threads\n";

for (1 .. $MAX_THREADS) {;
    my $thr = threads->create(
        sub {
            my $sum = 0;

            # Thread will loop until no more work
            #   using ->dequeue will block the execution
            #   when there are no jobs to be done, unless 
            #   another mechanism takes care of that
            #   and handles the empty job queue
            while (defined (my $file = $q->dequeue_nb())) {
                my $incr_sum = _get_file_sum("$data_dir/$file");   
                $sum += $incr_sum; 
            }
            return $sum;
        }
    );
}

=head3 Wait for all threads to finish and collect all results

=cut

{
    my @thr_results = map { $_->join() } threads->list();

    say "Pending jobs:";
    p $q->pending();

    say "Collected results:";
    p @thr_results;

    $t1_b = Benchmark->new;
    $td_b = timestr(timediff($t1_b, $t0_b));

    say "Done: sum is " . sum @thr_results;

    say "Run time when 1 queue => $td_a";
    say "Run time when $MAX_THREADS threads => $td_b";
}

exit(0);

=head2 PRIVATE METHODS

=head3 _get_file_sum

=cut

sub _get_file_sum {
    my ($file) = @_;

    open my $fh, '<', $file or die "$!";

    # For benchmarking purposes
    sleep 1;

    my $work;
    while (my $line = <$fh>) {
        chomp $line;
        $work += $line;
    }

    say "\t\tFile $file: sum = $work" if defined $work;

    return $work;
}

Code on github

Sunday 22 February 2015

Perl one liners and other tools for fetching/posting internet content (WIP)


 

Fetching content

 
curl -X GET http://tttt.co.uk/api/asset/id/1
 
curl -H "Accept: application/json" -H "Content-Type: application/json" \
     -X GET  http://localhost:3030/download5/test.json 

wget http://tttt.co.uk/api/asset/id/1 -O test1.json

perl -MLWP::Simple -e 'print get("http://tttt.co.uk/api/asset/id/1")'

perl -MLWP::Simple -e 'getstore("http://tttt.co.uk/api/asset/id/1", "url_content.json")'

perl -MHTTP::Tiny -e 'print HTTP::Tiny->new->get("http://tttt.co.uk/api/asset/id/1")->{content}' 

 

Posting content

 
curl -X POST -d name=aaaa category=3 http://localhost:3010/api/asset
              (posts data using the Content-Type application/x-www-form-urlencoded)  
curl -H "Accept: application/json" -H "Content-Type: application/json" \
     -X POST --data '{"name":"aaaa","category":"3"}' http://localhost:3030/api/asset
 
curl -X POST -F 'file=@data.csv'  http://localhost:3010/api/category
             (posts data using the Content-Type multipart/form-data) 
 
curl -X POST -T 'asset.csv'  http://localhost:3010/api/asset
             (upload of a file) 
 

Saturday 21 February 2015

Chat server implemented in Perl, based on AnyEvent

Event-driven implementation of a chat server, with one main processing thread.

Uses tcp_server method from AnyEvent::Socket for easy creation of a non-blocking TCP connection. Inside the connection callback, the connecting client is informed about other already connected clients and client information (host:port identifier and the client socket handle) is stored in a hash. The client file/socket handle, available in the tcp_server callback after a client connects to the server, is wrapped in a AnyEvent::Handle object to allow event-driven access and manipulation. The on_read handler of the client socket handle deals with the client message, sending it to all other connecting servers. The client can send a message either directly, or first send OK, followed by the message itself. 

#!/usr/bin/perl
 
=head2 chat_server.pl

Perl chat server based on AnyEvent

Server:     perl $0
Clients:    telnet 127.0.0.1 8888 (run in several terminals)
            clients communicate by:
                                    a) sending message terninated with carriage return
                                    b) sending OK, followed by carriage return
                                       sending  message terninated with carriage return  

=cut

use strict;
use warnings;
use utf8;
use v5.018;

use AnyEvent;                           # creates event driven loop
use AnyEvent::Socket qw(tcp_server);    # provides high level function to create tcp server
use AnyEvent::Handle;                   # creates non-blocking (socket) handle

use Data::Dumper qw(Dumper);

sub _inform_clients;

=head2 Store connected clients in a hash structure

key:    $host:$port ..... uniquely identifies a connected client
value:  socket handle ... so we can continue communication with individual clients

=cut

my %client = ();

=head2 Create TCP server

allow connection from everywhere, on a specified port

=cut

tcp_server undef, 8888, sub {
    my ($fh, $host, $port) = @_;

    say "[$host:$port] connected";

=head3 On connection, tell the client how many are already connected

=cut

    syswrite $fh, "Hello friend. There are currently " . scalar(keys %client) . 
                  " connected friends.\015\012";

    _inform_clients(\%client, "Friend [$host:$port] joined us!");

=head3 Create nonblocking socket handle for the client

=cut

    my $hdl = AnyEvent::Handle->new(
        fh => $fh,
    );

=head3 Store client information

=cut

    my $client_key = "$host:$port";
    $client{$client_key} = $hdl;

=head3 On error, clear the read buffer

=cut

    $hdl->on_error (sub {
        my $data = delete $_[0]{rbuf};
    });

=head3 On receiving a message from a client

We expect:

    sending a regular message
        either "OK\n", then a message
        or      directly a message
    disconnecting
        send quit/QUIT followed by carriage return

=cut

    my $writer; 
    $writer = sub {
        my ($hdl, $line) = @_;
        say "Reading from client: [$line]";

        my @clients = keys %client;
        say Dumper(\@clients);

        # The client cannot disconnect until we release its handle
        if ($line =~ /\Aquit|bye|exit\z/i) {

            my $client_count = (scalar keys %client) - 1;       # exclude the leaving client
            say "REMAINING (apart from this): $client_count";

            # Send message to each client
            for my $key (@clients) {

                if ($key eq $client_key) {
                    $hdl->push_write("Bye\015\012");
                }
                else {
                    my $message = ($client_count > 1) ? "only $client_count of us left\015\012" : 
                                                         "You are the only one left :(. Send quit/QUIT to disconnect\015\012";
                    $client{$key}->push_write("Friend $client_key is leaving us, $message");
                }

            }

            $hdl->push_shutdown;
            delete $client{$client_key};
            
        }
        # if we got an "OK", we have to _prepend_ another line,
        # so it will be read before the second request reads the 64 bytes ("OK\n")
        # which are already stored in the queue when this callback is called
        elsif ($line eq "OK") {
            $_[0]->unshift_read (line => sub {
                my $response = $_[1];
                for my $key (grep {$_ ne $client_key} @clients) {
                    $client{$key}->push_write("$response from $client_key\015\012");
                }
            });
        }
        elsif ($line) {
            for my $key (grep {$_ ne $client_key} @clients) {
                my $response = $line;
                $client{$key}->push_write("$response from $client_key\015\012");
            }
        }
    };

=head3  Enter the request handling loop

=cut

    $hdl->on_read (sub {
        my ($hdl) = @_;

        # Read what was sent, when request/message received
        # (then distribute the message)
        $hdl->push_read (line => $writer);
    });

};

=head3 Start the event loop

=cut

AnyEvent->condvar->recv; 

=head2 SUBROUTINES

_inform_clients

=cut

=head2 _inform_clients

sends a message to all known/stored clients

=cut

sub _inform_clients {
    my ($client_href, $message) = @_;

    for my $key (keys %$client_href) {
        $client{$key}->push_write("$message\015\012");
    }
}

Source code on github 

Numeral systems and bit shifting quick overview

Numeral Systems


Numeral system Radix/root Digits Example In Decimal system
Binary

0,1 0,1

Byte 2

Groupings of 8 binary digits
(representation of a byte or integer (16/32/64 bits))
01010001 (02 × 27) + (12 × 26) + (02 × 25) + (12 × 24) + (03 × 23) + (02 × 22) + (02 × 21) + (12 × 20)
Decimal 10

0-9 124 (110 × 102) + (210 × 101) + (410 × 100)
Octal 8

0-7 02732 (28 × 83) + (78 × 82) + (38 × 81) + (28 × 80)
Hexadecimal 16

0-9,A-F (corresponding to 10-15) 0x2AF3 (216 × 163) + (A16 × 162) + (F16 × 161) + (316 × 160)

Bit Shifting

 

128
64 32 16 8 4 2 1

27         

26

25

24

23

22

21

20

0 1 0 1 0 0 0 1



0     1   0    1    0   0  0   1 => 1*64 + 1*16 + 1*1 = 81

 

 

Arithmetic bit shifting to the right with >>

 


Makes bits fall of the right and adds zero padding to the left. This is equivalent to arithmetic division.

01010001 = 1*64 + 1*16 + 1*1 = 81

$y =  0b01010001 >> 1
>> 1 … shifting by one position to the right:
01010001 → 00101000 = 1*32 + 1*8 = 40   (ie int(81/2))

Arithmetic bit shifting to the left with <<


Makes bits fall of the left and adds zero padding to the right. This is equivalent to arithmetic multiplication by 2 to the number on the right of the operator.

01010001 = 1*64 + 1*16 + 1*1 = 81

$y =  0b01010001 << 1

<< 1 … shifting by one position to the left, we are multiplying by 2 to 1:
01010001 → 10100010 = 1*128 + 1*32 + 1*2 = 162 (ie 81 * 2)


NOTE

 

The number of shift positions needs to result in a value within the allowed range of the original value type:

Wrong:

<< 2:
01010001 (81) → 01000100 = 1*64 + 1*4 =68

Wednesday 11 February 2015

Unicode for the Half-initiated (Perl biased)

  1. Background

  2. Perl


Background

There are under 7 000 languages, about one third of which have a writing system. The challenge is to be able to represent all writing systems using one encoding set.

         And at the beginning was ASCII ...


         ("American Standard for Information Interchange", 1963)

ASCII, representing 128 characters, 33 non-printable control characters and the rest used for encoding of the English alphabet. ASCII developed from telegraphic codes. Its characters are encoded into 7-bit binary integers (with most significant bit being 0), giving the total of 128 possibilities. Using 8 bits extends the range to 255 characters. One of the encodings covering this range (called extended ASCII) is latin-1/ISO-8859-1.

After computers spread to other countries, other encodings were needed to represent characters in other languages, not available in ASCII. Western Europe uses Latin-1 (ISO-8859-1), Central Europe Latin-2 (ISO-8859-2) etc.

These local character sets are limited in the ability to provide character representations. So, a Unicode Consortium was created in 1991 in the attempt to unify all character representations and provide one encoding that would be able to represent any writing system.  A collection of all known characters was started. Each character was assigned a unique number, called code point.

The code point is usually written as a four or six digit hex number (eg U+07FF). Some characters have a user-friendly name, like WHITE SMILING FACE (☺) or SNOWMAN (☃).  Apart from base characters like A etc, there are accents and decorations (umlaut etc). A character followed by an accent, forming a logical character, is called a grapheme. Unicode is an umbrella for  different encoding forms: UTF-8, UTF-16 and UTF-32. UTF-8 is the most popular encoding, at the beginning of 2015 used on around 82% of World Wide Web pages.

                                      Picture of how characters map to bytes.

http://www.w3.org/International/articles/definitions-characters/images/encodings-utf8.png

Originally it was assumed 16 bits to represent one character, giving 16 536 (216 ) options, would suffice. However soon the ambition was to be able to represent all possible writing systems, so more bytes were needed for character representations. The first 65 536 code points is called a Basic Multilingual Plane (BMP). There are 16 more Multilingual planes designed to hold over 1,000,000 of characters. These planes are not contiguously populated, leaving blocks of code points for future assignment.

 http://rishida.net/docs/unicode-tutorial/images/unicode-charset2.png

http://rishida.net/docs/unicode-tutorial/images/unicode-charset2.png 

 

UTF-8 Encoding


Number        First                   Last               Bytes 
of bits          code point         code point
-----------------------------------------------------------------------------------------------------
  7                   U+0000           U+007F
 0xxxxxxx
11                   U+0080           U+07FF
 110xxxxx 10xxxxxx
16                   U+0800           U+FFFF
 1110xxxx 10xxxxxx 10xxxxxx
21                   U+10000           U+1FFFFF
 11110xxx 10xxxxxx 10xxxxxx by 10xxxxxx

UTF-8, unlike UTF-16, is a variable length encoding, where different code point ranges are represented by 1 byte or a sequence of 2,3 or 4 bytes. The first 128 characters are equivalent to ASCII. These have the higher order bit 0. Code points represented by more bytes, have the higher bit 1, followed by as many 1s as there are remaining bytes representing the given character. This is how system can understand the octet stream and decode it into characters.

Encode :    into binary string
Decode:     into character string

If the system cannot interpret a sequence of octets, because it assumes a wrong encoding, a warning about a wide character is given and a placement character is used. The solution is to encode the string into the desired encoding, then decode into a character string.

BOM and surrogates

 

UTF-16 and UTF-32 use 2 and 4 bytes respectively for character representation and need to deal with the endianness/the byte order, associated with the particular processor. Big endian order: most significant bits stored first vs little endian. BOM (Byte Order Mark) is a short byte sequence (U+FEFF or U+FFFE code points), present at the beginning of text etc, that clarifies the byte order information (which byte is the first one?) and allows correct decoding. UTF-8 does not suffer from the endian problem.

UTF-16 uses 2 bytes for all character representations, even the first 255 characters. Two byte encoding tackles the BMP, ie 65 536 characters; higher code points correspond to surrogate pairs, two 16 bit units.

 

Perl

use utf8;                        # to be able to use unicode in variable names and literals
use feature "unicode_strings";   # to use character based  string operations
use open    ":encoding(UTF-8)";  # expect UTF-8 input and provide UTF-8 output
use charnames ":loose";          # to be able to use \N{WHITE SMILING FACE}
 
# already created filehandle 

binmode STDOUT, ':iso-8859-1';
binmode $fh,    ':utf8'; 

# Database access

# DBI

$dbh = DBI->connect($dsn, $user, $password,
                    { RaiseError => 1, AutoCommit => 0,  

                      mysql_enable_utf8 => 1 }); 

# DBIx::Class 

$self->{ schema } = Schema->connect("dbi:mysql:somedatabase", 
                                        'sqluser', 'scret',
                                       { mysql_enable_utf8 => 1 },
                        ) or die "Cannot connect to database";
 
# Catalyst model - DBIx::Class

 __PACKAGE__->config(
    schema_class => 'MyApp::Schema',
    
    connect_info => mysql{
        dsn => 'dbi:mysql:somedatabase',
        user => 'mysqluser',
        password => 'scret',
        mysql_enable_utf8 => 1,
    }   
);