Turtle:处理非utf8输入

时间:2017-06-15 11:21:40

标签: haskell character-encoding haskell-pipes haskell-turtle

在我学习管道的过程中,我在处理非utf8文件时遇到了问题。这就是为什么我绕过Turtle库试图理解如何在更高的抽象层次上解决问题。

我想要做的练习非常简单:找到从给定目录可以到达的所有常规文件的所有行的总和。这可以通过以下shell命令轻松实现:

find $FPATH -type f -print | xargs cat | wc -l

我提出了以下解决方案:

import qualified Control.Foldl as F
import qualified Turtle        as T

-- | Returns true iff the file path is not a symlink.
noSymLink :: T.FilePath -> IO Bool
noSymLink fPath = (not . T.isSymbolicLink) <$> T.stat fPath

-- | Shell that outputs the regular files in the given directory.
regularFilesIn :: T.FilePath -> T.Shell T.FilePath
regularFilesIn fPath = do
  fInFPath <- T.lsif noSymLink fPath
  st <- T.stat fInFPath
  if T.isRegularFile st
    then return fInFPath
    else T.empty

-- | Read lines of `Text` from all the regular files under the given directory
-- path.
inputDir :: T.FilePath -> T.Shell T.Line
inputDir fPath = do
  file <- regularFilesIn fPath
  T.input file

-- | Print the number of lines in all the files in a directory.
printLinesCountIn :: T.FilePath -> IO ()
printLinesCountIn fPath = do
  count <- T.fold (inputDir fPath) F.length
  print count

只要目录中没有非utf8文件,此解决方案就会给出正确的结果。如果不是这种情况,程序将引发如下例外:

*** Exception: test/resources/php_ext_syslog.h: hGetLine: invalid argument (invalid byte sequence)

从那以后可以预料到:

$ file -I test/resources/php_ext_syslog.h
test/resources/php_ext_syslog.h: text/x-c; charset=iso-8859-1

我想知道如何解决将不同编码读入Text的问题,以便程序可以解决这个问题。对于手头的问题,我想我可以避免转换为Text,但我宁愿知道如何做到这一点,因为你可以想象一种情况,例如,我想做一个设置在某个目录下的所有单词。

修改

到目前为止,我能提出的唯一解决方案如下:

mDecodeByteString :: T.Shell ByteString -> T.Shell T.Text
mDecodeByteString = gMDecodeByteString (streamDecodeUtf8With lenientDecode)
  where gMDecodeByteString :: (ByteString -> Decoding)
                             -> T.Shell ByteString
                             -> T.Shell T.Text
        gMDecodeByteString f bss = do
          bs <- bss
          let Some res bs' g = f bs
          if BS.null bs'
            then return res
            else gMDecodeByteString g bss

inputDir' :: T.FilePath -> T.Shell T.Line
inputDir' fPath = do
  file <- regularFilesIn fPath
  text <- mDecodeByteString (TB.input file)
  T.select (NE.toList $ T.textToLines text)

-- | Print the number of lines in all the files in a directory. Using a more
-- robust version of `inputDir`.
printLinesCountIn' :: T.FilePath -> IO ()
printLinesCountIn' fPath = do
  count <- T.fold (inputDir' fPath) T.countLines
  print count

问题是这会为每个文件再计一行,但至少允许解码非utf8 ByteString

0 个答案:

没有答案