One of the fun bits about this project is the text munging that comes with it. The regexp libraries for Haskell have super-sophisticated do-what-I-think-you-mean interfaces and not enough (simple) use-cases in the docs. Couple that with my concerns about Unicode support and I'm stuck doing it the very old fashioned way.
OK, enough editorialising; I've written a mostly-{RFC 4810, Haskell 98}-compliant lazy CSV parser that appears to work OK on reasonable-sized inputs. Existing solutions use Parsec, whose return type seems to guarantee that more-or-less the entire output must reside in memory at some point. This might be OK for small files, but the 6Mb of Unicode data I need to import consumes a ridiculous amount of memory, even with GHC's optimiser going full-bore.
You can find it here. The licence is BSD. Couple it with the appropriate utf8-string for your GHC and it works well on UTF-8-encoded files.
Now, to track down a nasty memory leak somewhere in the database
code... the profiler tells me SYSTEM
is hanging onto some
stuff, but not what SYSTEM
actually is. Err, what did Fergus say
again?