peteg's blog - AYAD - Project - 2007 12 06 HOPE Unicode

Lest I forget, Haskell and Unicode.

One reason I ran away from all of the CMS systems implemented in PHP is its (historically) crappy support for Unicode [*]. Standard Haskell, on the other hand, has required the Char type to be able to represent a Unicode codepoint for quite a while now. Unfortunately there are a few libraries that are not Unicode friendly, such as just about every library interfacing with C.


  • HSQL needed some work to get it to talk UTF-8 to PostgreSQL.
  • Most but not all of the CGI library is Unicode friendly. I don't know enough about the various RFCs to know what's encoded as what, so I don't know how to do this right. For example, how are Unicode filenames handled?
  • The regexp libs are a bit of a minefield (the user-interface is quite complex, and those C libraries are unknown quantities), so I have avoided using them.
  • HOPE itself is almost entirely encoding-agnostic, apart from the top-level (where it builds a CGI header for the webserver's consumption), and HaskellDB just punts around the strings fairly blindly, doing a minimal amount of escaping. Good job, Björn.

I really, really wish Haskell had a decent story about character encoding at the I/O level. Back in 2002 people seemed to get really excited about doing something about it, but that mailing list is dead now. I guess the hope is that once ByteStrings and all that are bedded down, the I/O layer can be rebuilt on efficient foundations, fusion will take care of performance issues with codec layers and so forth.

Update: ConradP has surveyed some Haskell character munging libraries.

[*] perl has good Unicode support, if one is happy to play the guessing game as to what format each string is in. I feel that strong typing — clearly separating characters from strings of bytes — is just what is needed here.