This text is based on an article rewritten by former core team member David Gal for a German Linux publication.
UTF-8 is a variable length character encoding using one to four bytes per character, depending on the Unicode symbol. Four bytes may seem like a lot for one character, however, this is required only for special characters outside the Basic Multilingual Plane, which are generally very rare. The first byte (positions 0-127) is used for encoding ASCII which gives the character set full backward compatibility with ASCII.
UTF-8 is becoming the standard and internationally accepted multilingual environment and is the preferred way to communicate non-ASCII characters over the Internet. Being a subset of Unicode, UTF-8 has the special benefit of using less space to store or transmit ASCII. As the bulk of Internet transmissions are using the 7 bit ASCII characters, UTF-8 encoding saves volume and bandwidth.
It also provides a single encoding for all other characters that were previously implemented using 8 bit character codes hand-in-hand with a specific encoding table (i.e iso-8859-2) in order to know how to represent the character code. Up to now this basically limited Web page display to ASCII Latin characters plus one other language or set of diacritic Latin characters (accents and umlauts for example). UTF-8 now provides one code page for all languages.
Migration to UTF-8 promises to be simple for existing ASCII texts as UTF-8 encoding for ASCII has no changes.
Up to now in order to change from one encoding to another all that was required was to change the _ISO definition in the language file resulting in the ‘charset=myNewEncoding’ statement in the html meta tag. This was simple as all encodings were single-byte character encodings. For the entire Joomla! system, a character equals a byte and Joomla! didn’t really care what the character representation of the particular byte is.
Now, in Joomla! 1.5, we are starting to use multi-byte characters and not only that – some are one byte long and some are 3 bytes long. How does this affect Joomla! and what is required to truly be able to state that Joomla! supports UTF-8?
The first major challenge was the fact that there are still many hosts that are running MySQL version 4.0.x and older databases. These do not have UTF-8 support. It is possible to store UTF-8 data in non UTF-8 tables. As far as the database is concerned it is storing bytes and returning them to the application when needed.
However, as already mentioned, there is a possibility that the user will want to store a 20 character field that holds UTF-8 characters that are not in the regular ASCII area. If these are committed to a varchar (20) database field – the data will be truncated. This is not only a problem with non “Latin character�? languages (that are normally in the multi-byte area) but also with all European languages with possible the exception of English. Every one of these languages has some special Latin characters (accents and umlauts for example) that are now multi-byte characters. The word ‘käse’ is now 5 bytes long!
The core team rightfully decided that Joomla! 1.5 should also be able to work on older databases and not only that – the backward compatibility should be transparent to the user. The installer now checks for the version of MySQL – if it is version 4.1.2 and up, then UTF-8 tables are created with the user being able to choose the desired collation. If the database is not supporting UTF-8 then the installer actually runs a separate script creating a database structure that provides extra storage space for potentially longer strings. This is anticipated to eliminate the danger of data truncation by the database.
The second major challenge relates to the lack of UTF-8 support in PHP. All standard string functions in PHP are only able to work with single-byte characters. Using these functions on UTF-8 encoded data can result in logical failures and also in data corruption.
The problem lies with the fact that until PHP 6 is released, there is no comprehensive native UTF-8 support in PHP. There is a multi-byte extension named ‘mbstring’ which exists from version 4.1 but it is not loaded by default. In addition it also serves other multi-byte encodings such as some Far Eastern languages. This means that it may be present but not set to the correct settings for UTF-8. An additional extension named ‘iconv’, which has some parallel capability, is present in PHP 5 but optional and missing some functions in PHP 4.
Here again, the core team decided to vote for full backward compatibility and for the solution to be transparent to the user. The solution is a combination of either using PHP provided functions, if they are present, or using a special library of UTF-8 aware string functions, if no PHP native functions are available. This provides the best performance (PHP functions available) together with complete backward compatibility. A Joomla String Class provides this functionality and it will be included in the API for third party developers.
There is no user configuration or setup required regarding PHP UTF-8 support. There is one small exception to this rule which could theoretically occur if, in the host, one or two of the mbstring settings (that cannot be changed from within code) are set to a value that is adverse to UTF-8. The installer will identify this and advise on how to change the setting locally using .htaccess.
Considering that data will, in most cases, need to be converted to UTF-8, it will be recommended to migrate existing data to a freshly installed Joomla! 1.5 site and not to perform upgrades of existing Joomla! 1.0.x sites. Specific migration guidance will be provided with the release.
Joomla! 1.5 will take a huge jump ahead of the rest of the CMS pack with its Internationalisation features. UTF-8 is undoubted, the big discriminator – all languages with one encoding. In addition, RTL support and the language packs for back-end, installer and help system, make Joomla! 1.5 a complete package for use in any language or combination of languages. JoomFish will be the icing on the cake.
Providing decent localisation support for Joomla - the kind of support that will carry smoothly across to the work of all extension providers (from language packs to components, modules and plugins), requires a certain amount of attention to some nitty gritty details. Some questions that pop up are: “How do we identify a language?” - followed by: “How do we provide a consistent naming convention?” and: “How will everyone know about this?”
Simple - we need a convention - preferably public - hopefully without ambiguity - and it should be kept current.
A little bit of digging unearthed RFC 3066 and a decision was made to use it as the convention for language identification in Joomla as of version 1.5.
This results in the following conventions for the language names :