Transliterating Serbian Cyrillic to Serbian Latin on Linux with PHP

Billet

Mozilla has beeen shipping Firefox in Serbian for many years and we ship it in cyrillic script, that means that our software, our sites, our documentation is all in cyrillic for Serbian.

You may not know it (especially if you are not European), but Serbian can be written in both Cyrillic and Latin scripts, people live with the two writing systems, that is a phenomenon called synchronic digraphia.

I was wondering of it would be easy to create a version of Firefox or Firefox OS in Latin script and since our l10n community server just got an upgrade and now has PHP 5.4, I played a bit with the recent transliterator class in that version that uses the ICU library.

Basically, it works, and it works well. With one caveat though, I found out that the ICU library shipped with Linux distro is old and exposes a bug in Serbian transliteration that was fixed in more recent ICU libraries.

How does it work? Here is a code example:

$source = 'Завирите у будућност';
$t = Transliterator::create('Serbian-Latin/BGN');
print "Serbian (Cyrillic): $source <br>";
print "Serbian (Latin): {$t->transliterate($source)}";

And here is the output:

Cyrillic: Завирите у будућност
Latin: Zavirite u budućnost

The bug I mentioned earlier is that the cyrillic letter j is systematically converted to an uppercase J even if the letter is inside a word and should be lowercase.

Example: This string : Најгледанији сајтови
Should be transliterated to: Najgledaniji sajtovi
But my script transliterated it to: NaJgledaniJi saJtovi

I filed a bug in the PHP ticket system and got an inmediate response that my test script actually works on Windows. After some investigation by the PHP dev, it turns out that there is no bug on the PHP side, the bug is in the ICU library that ships with the OS and it happens to be version 48.x on Linux distros while Windows enjoys a more recent version 50 and the ICU project itself is at version 51.2

Unfortunately, I couldn't find any .deb package or ppa for Ubuntu that would propose a more recent ICU library version, Chris Coulson from Canonical pointed me to this ticket in Launchpad: [request] upgrade to icu 50, but this was an unassigned one.

As a consequence, I had to compile the newer ICU library myself to make it work. Fortunately, I could follow almost all the steps indicated in this post for a CentOS distro, I only had to adjust the php.ni locations (and also update the php.ini file for the development server) and restart Apache :)

So now, I can transliterate easily from cyrillic to Latin a full repository, I put a gist file online with the full script doing the conversion of a repo if you want to use it.

Commentaires

1. Le mercredi 31 juillet 2013, 21:14 par tom jones

there is really no need to use a heavy-hitter like ICU for such a simple task (afaik, it needs several tens of MB of memory).

transliteration from serbian cyrilic to latin can be done with the built in str_replace() function:

function transliterate($source) {

   static $cyr = ['а', 'б', 'в', 'г', 'д', 'ђ', 'е', ..., 'А', 'Б', 'В', ...];
   static $lat = ['a', 'b', 'v', 'g', 'd', 'đ', 'e', ..., 'A', 'B', 'V', ...];
   return str_replace($cyr, $lat, $source);

}

this is possible because cyrilic->latin transliteration is a direct, one-to-one function, without *any* exceptions. the other way around, latin->cyrilic, isn't as simple (the letters Љ, Њ and Џ need two characters in latin alphabet: Lj, Nj and Dž).

and as str_replace() is implemented in C, this should have similar performance (except for the less memory used).

2. Le mercredi 31 juillet 2013, 22:41 par Pascal Chevrel

@Tom I know that a simple table can solve the case of Serbian transliteration, now explain me how you will solve the case of Traditional Chinese to Simplified Chinese transliteration please? Or any other transliteration. The transliterator class *is part of the PHP language* and the ICU library is also a native C library already installed on servers, and if it is not, it should, because you need it to use all of php-intl functions (http://php.net/manual/fr/book.intl.php). Seriously, for a one off task of converting strin repositories from Serbian Cyrillic to Serbian latin, one could not care less about potentially consuming a few extra megabytes and I seriously doubt that a str_replace is going to be faster than the native transliterator C class.

3. Le mercredi 31 juillet 2013, 23:43 par tom jones

hey, what's with the defensive tone? i was just pointing an easy alternative for this simple task.

anyway, for your "one off task", you had to file 2 bugs (one for the Operating System no less), plus to compile and configure a library..

but whatever..

4. Le jeudi 1 août 2013, 08:45 par Pascal Chevrel

sorry for the defensive tone Tom, in hindsight I realize I read your comment after a very bad day and that shows, I didn't want to sound agressive, sorry about that!

5. Le mercredi 14 août 2013, 17:51 par Ted Mielczarek

I wonder if we'll be able to do this in JS soon? We recently imported ICU into the Mozilla tree to enable work on the ECMAScript i18n API.