utf 8 to dream the impossible dream
I've spent the better part of a week trying to figure out how to do UTF-8 in PHP with data stored in MSSQL, MySQL, and PostgreSQL and I'm here to report that it's just about impossible. First off you have PHP which is pretty much a mess when it comes to UTF-8 or really any character encodings. Some functions take charset params some don't, there's 2 different but similar charset libraries (mbstring, iconv) both of which have issues and neither of which is installed by default until PHP 5 and then only iconv, except on FreeBSD (or so I read). It's enough to drive you insane.
If you're building a custom application or an in house application there's enough there that you can make it work, but if you're trying to write a distributable application it's very very close to impossible especially when you factor in the database.
See each database handles things totally different. First you have SQL Server which doesn't even store UTF-8 at all. Instead it stores UCS-2 so right there you'd have to convert your nice UTF-8 to UCS-2 before inserting and when you do selects you'll need to convert it back. Then you have MySQL which as no support in the 3 series, no real support in 4 until 4.1 which is a pretty big limitation in terms of requirements. Finally you have PostgreSQL which I honestly barely got to look into. It seems that it stores UTF-8 so long as you compile it with support for it. I'm not sure if that's the "standard" way to compile it or not.
Hence, I've fallen back to fixing up some issues HelpSpot currently has with ISO-8859-X encodings and making sure things work well on that front. Hopefully at some point in the future these things start to come into line. The word on the street is that PHP6 will make unicode the native format and by that time perhaps Microsoft will have a better way to handle it in SQL Server and the install base of MySQL 4.1+ will be big enough to make the switch.
In case anyone else is looking to make the UTF journey with PHP here are a few links to some of the better resources I found:
- Great discussion of PHP issues with UTF-8
- Dokuwiki PHP UTF-8 library
- Some sample UTF-8 characters
- Textpatterns UTF page - they appear to have it mostly working though they note that sorting and indexing may not always work correctly. Probably not a huge deal in an open source CMS, but that's not acceptable for HelpSpot where sorting and filtering are perhaps the most important functions the system performs.