Question

当我尝试将{1}}读取UTF8文本文件时，我尝试使用Text。
但是，当系统环境的区域设置不是Data.Text.IO.readFile（尤其是*.UTF8）时，它不起作用。
它说c

是的，我在文档hGetContents: invalid argument (invalid byte sequence)中阅读了区域设置支持部分。它说Data.Text.IO取决于系统环境的设置。

因此，我尝试将Data.Text.IO函数与Data.Text.IO.hGetContents一起使用这与我System.IO.hSetEncoding h System.IO.utf8_bom一起使用时有效。

但是，对于System.IO.hGetContents，它会显示Data.Text.IO.hGetContents。

如果不更改text: <stdout>: commitAndReleaseBuffer: invalid argument (invalid character)等系统环境变量，是否无法使用Data.Text.IO.hGetContents或Data.Text.IO.readFile处理编码？只编辑Haskell代码的方法是首选。

这是我的原始代码：

LANG

这是我的试用代码：

import qualified Data.Text as T
import qualified Data.Text.IO as T

main = do
  text <- T.readFile "./data.md"
  T.putStrLn text

当系统的区域设置为import qualified Data.Text as T import qualified Data.Text.IO as T import System.IO main = do h <- System.IO.openFile "./data.md" System.IO.ReadMode System.IO.hSetEncoding h System.IO.utf8_bom text <- T.hGetContents h -- `System.IO.hGetContents h` works! T.putStrLn text时，这些方法有效，并且在其他环境中失败。

经测试的环境信息：

Linux（Ubuntu 14.04）
GHC 7.10.3
*.UTF8 1.2.2.0

Answer 1

我得到了一个不同的错误：

<stdout>: hPutChar: invalid argument (invalid character)

即使使用System.IO.hGetContents，我也会收到相同的错误。不确定为什么行为会因你而异。（我使用ghc-7.10.2和text-1.2.1.3）

要回答这个问题：您正尝试将UTF-8 - 已编码的字符串发送到stdout，并为ASCII配置。我不确定应该输出什么。

如果您的终端实际接受UTF-8，那么您可以将stdout配置为忽略当前区域设置并接受UTF-8：

main = do
  h <- System.IO.openFile "./data.md" System.IO.ReadMode
  System.IO.hSetEncoding h System.IO.utf8_bom
  text <- T.hGetContents h
  System.IO.hSetEncoding stdout System.IO.utf8_bom
  T.hPutStrLn stdout text

Answer 2

执行此操作的正确方法是使用bytestring读取文件，并使用text-icu进行{en，de}编码。（text文档中提到了这一点：“要使用扩展且非常丰富的函数系列来处理Unicode文本...，请参阅text-icu包”。）例如，以下内容Haskell文件使用LANG=en_US.utf8和LANG=C正确读取我的测试文件：

import qualified Data.ByteString as BS
import qualified Data.Text.ICU.Convert as ICU

import System.IO

main = do
    -- dunno what the Nothing argument is for, read the docs!
    conv <- ICU.open "utf-8" Nothing
    h    <- openFile "test.txt" System.IO.ReadMode
    bs   <- BS.hGetContents h
    print (ICU.toUnicode conv bs)

N.B。我调用print而不是T.putStrLn - 因为我终端的输出将取决于区域设置！

有没有办法用Data.Text.IO.hGetContents处理编码？

2 个答案: