Question

readFile "file.html"
"start of the file... *** Exception: file.html: hGetContents: invalid argument (invalid code page byte sequence)

这是用记事本++创建的UTF-8文件...如何在haskell中读取文件？

Answer 1

默认情况下，文件是在系统区域设置中读取的，因此如果您有使用非标准编码的文件，则需要自己设置文件句柄的编码。

foo = do
    handle <- openFile "file.html" ReadMode
    hSetEncoding handle utf8_bom
    contents <- hGetContents handle
    doSomethingWithContents
    hClose handle

应该让你入门。请注意，这不包含错误处理，因此更好的方法是

import Control.Exception -- for bracket

foo = bracket
        (openFile "file.html" ReadMode >>= \h -> hSetEncoding h utf8_bom >> return h)
        hClose
        (\h -> hGetContents h >>= doSomething)

或

foo = withFile "file.html" ReadMode $
        \h -> do hSetEncoding h utf8_bom
                 contents <- hGetContents h
                 doSomethingWith contents

Answer 2

根据this site，您的6个字节解码如下：

EF BB BF -> ZERO WIDTH NO-BREAK SPACE (i.e. the BOM, although its not needed in UTF-8
C4 8D    -> LATIN SMALL LETTER C WITH CARON (what you said)
0D       -> CARRIAGE RETURN (CR)

所以它是合法的UTF-8序列。

然而，标准Prelude函数最初只是执行ASCII。我不知道他们现在做了什么，但是看到这个问题How does GHC/Haskell decide what character encoding it's going to decode/encode from/to?可以获得更多想法。然后使用http://hackage.haskell.org/package/utf8-string代替Prelude函数。

haskell - 无效的代码页字节序列

2 个答案: