One reason I ran away from all of the CMS systems implemented in
PHP is its (historically) crappy support for Unicode [*]. Standard
Haskell, on the other hand, has required the Char
type to be able to represent a Unicode codepoint for quite a while
now. Unfortunately there are a few libraries that are not Unicode
friendly, such as just about every library interfacing with C.
Concretely:
- HSQL needed some work to get it to talk UTF-8 to PostgreSQL.
- Most but not all of the CGI library is Unicode friendly. I don't know enough about the various RFCs to know what's encoded as what, so I don't know how to do this right. For example, how are Unicode filenames handled?
- The regexp libs are a bit of a minefield (the user-interface is quite complex, and those C libraries are unknown quantities), so I have avoided using them.
- HOPE itself is almost entirely encoding-agnostic, apart from the top-level (where it builds a CGI header for the webserver's consumption), and HaskellDB just punts around the strings fairly blindly, doing a minimal amount of escaping. Good job, Björn.
I really, really wish Haskell had a decent story about character
encoding at the I/O level. Back in 2002 people seemed to
get really excited about doing something about it, but that
mailing list is dead now. I guess the hope is that once
ByteString
s and all that are bedded down, the I/O layer
can be rebuilt on efficient foundations, fusion will take care of
performance issues with codec layers and so forth.
Update: ConradP has surveyed some Haskell character munging libraries.
[*] perl has good Unicode support, if one is happy to play the guessing game as to what format each string is in. I feel that strong typing — clearly separating characters from strings of bytes — is just what is needed here.