PHP and Unicode

PHP and Unicode
Andi Gutmans from Zend, Andrei Zmievski from Yahoo!

– Definitions
. Character Encoding Form: representation of a character set using a number of integer codes (code values)

– Multi-i18n-what?
. Dealing with multiple encodings is a pain
. Different algorithms, conversion, detection, validation, processing… understanding
. Dealing with multiple languages is a pain too
. But cannot be avoided in this day and age

-Challenges
. Need to implement applications for multiple languages and cultures
. perform language and encoding appropriate searching, sorting, word breaking, etc.
. Support date, time, number formats, etc
….

– Can’t PHP do it now?
. PHP is a binary processor
. The string type is byte-oriented
. encoding? what encoding?
. BUT isn’t it sweet that string vars can contain images?
. Not if you are trying to do real work!
. iconv and mbstring aren’t enough

– Unicode overview
. Developed by the unicode consortium
. covers all major living scripts
. version 4.0 has 96k+ characters
. capacity for 1M+ characters
. Unicode character set = ISO 10646

-Unicode is generative
. Composition can create “new” characters
. Base + non-spacing (combining) character(s)

– Unicode is cool
. Multilingual
. Rich and reliable set of character properties
. standard encodings: UTF-8, UTF-16, UTF-32
. Algorithm specifications provide interoperability
. But unicode != i18n

– Goals
. Native unicode string type
. Distinct binary and native encoding string types
. unicode string literals
. updated language semantics
. upgrade existing functions, rather than create new ones
. backwards compatibility
. making simple things easy and complex things possible
. focus on functionality
. parity with Java’s unicode and i18n support

– ICU
. International Components for Unicode
. Why not our own solution?
. Lots of know-how is required
. Reinventing the wheel
. In the spirit of PHP: borrow when possible, …

– Why ICU?
. it exists, full-featured, robust, fast, proven, portable, extensible, open source, supported and maintaned

– ICU Features:
– Unicode character properties
. unicode string class & text processing
. text transformations (normalization, upper/lowercase, etc)
…

– Major Milestones
. Retrofitting the engine to support Unicode
. Making existing extensions Unicode-aware
. Exposing ICU API

– Let there be unicode
. a control switch called unicode_semantics
. per-request configuration setting
. no changes to program behavior unless enabled
. does not imply no Unicode at all when disabled

– String types
. existing string types: only overloaded one, used for everything
. new string types:
. unicode: textual data (UTF-16 internally)
. binary: binary data and strings meant to be processed on the byte level
. native: for backward compatibilty, …

– String literals
. With unicode_semantics=off, string literals are old-fashioned 8bit strings.
. 1 character = 1 byte

– Unicode string literals
. with unicode_semantincs=on, string literals are of unicode type
. 1 char may be > 1 byte
. to obtain length in bytes one would use a separate function

– binary string literals
. binary string literals require new syntax
…

– Escape sequences
. Inside Unicode strings \uXXXX and \UXXXXXX escape sequences may be used to specify unicode code points explicitly

– Script encoding
. encoding can be also changed with a pragma
. pragma does not propagate to included files

– Output encoding
. specified the encoding for the stdout stream (echo, print)
. the script output is transcoded on the fly
. does not affect binary strings

– Conversion Issues
. Not all characters can be converted between unicode and legacy encodings
. php will always attempt to convert as much of the data as possible
. the severity of the error issued by PHP depends on the type of the encountered ..

– Operator support
. string offset operator works on code points, not bytes
. no need to change existing code if you work only with single-byte encodings, like ASCII or ISO-8859-1

– Functions
. Default distribution of PHP has a few thousand functions
. Most of them use parameter parsing API that aaccepts typed parameters
. the upgrade process can be alleviated by adjusting this API to perform automatic conversions
. the upgrade will be a continuous process that will require involvement from extension authors
. all functions should be analyzed to determine theur semantics as applied to unicode strings
. a set of guidelines are essential

-Stream IO
. Streams wil be in binary mode by default
. application can manage unicode conversion explicitly unicode_decode()
. or apply a conversion filter to the stream

– Unicode identifiers
. PHP will allow unicode characters in identifiers

– When can I have it?
. Now if you want, CVS tree
. Most of the described functionality described is available

Nuno’s blog

Snippets and Bits…