juillet 2013

Archives

Entries list

mercredi 31 juillet 2013

Transliterating Serbian Cyrillic to Serbian Latin on Linux with PHP

Mozilla has beeen shipping Firefox in Serbian for many years and we ship it in cyrillic script, that means that our software, our sites, our documentation is all in cyrillic for Serbian.

You may not know it (especially if you are not European), but Serbian can be written in both Cyrillic and Latin scripts, people live with the two writing systems, that is a phenomenon called synchronic digraphia.

I was wondering of it would be easy to create a version of Firefox or Firefox OS in Latin script and since our l10n community server just got an upgrade and now has PHP 5.4, I played a bit with the recent transliterator class in that version that uses the ICU library.

Basically, it works, and it works well. With one caveat though, I found out that the ICU library shipped with Linux distro is old and exposes a bug in Serbian transliteration that was fixed in more recent ICU libraries.

How does it work? Here is a code example:

$source = 'Завирите у будућност';
$t = Transliterator::create('Serbian-Latin/BGN');
print "Serbian (Cyrillic): $source <br>";
print "Serbian (Latin): {$t->transliterate($source)}";

And here is the output:

Cyrillic: Завирите у будућност
Latin: Zavirite u budućnost

The bug I mentioned earlier is that the cyrillic letter j is systematically converted to an uppercase J even if the letter is inside a word and should be lowercase.

Example: This string : Најгледанији сајтови
Should be transliterated to: Najgledaniji sajtovi
But my script transliterated it to: NaJgledaniJi saJtovi

I filed a bug in the PHP ticket system and got an inmediate response that my test script actually works on Windows. After some investigation by the PHP dev, it turns out that there is no bug on the PHP side, the bug is in the ICU library that ships with the OS and it happens to be version 48.x on Linux distros while Windows enjoys a more recent version 50 and the ICU project itself is at version 51.2

Unfortunately, I couldn't find any .deb package or ppa for Ubuntu that would propose a more recent ICU library version, Chris Coulson from Canonical pointed me to this ticket in Launchpad: [request] upgrade to icu 50, but this was an unassigned one.

As a consequence, I had to compile the newer ICU library myself to make it work. Fortunately, I could follow almost all the steps indicated in this post for a CentOS distro, I only had to adjust the php.ni locations (and also update the php.ini file for the development server) and restart Apache :)

So now, I can transliterate easily from cyrillic to Latin a full repository, I put a gist file online with the full script doing the conversion of a repo if you want to use it.