从文本文件解析时,Haskell attoparsec无法识别换行符

时间:2018-05-08 08:27:02

标签: haskell attoparsec

我一直在尝试解析一个带有英文文本的.txt文件。我的代码尝试返回该.txt文件中的段落数。出于某种原因,attoparsec似乎无法识别换行符或任何其他字符,例如\n\r\t。以下是我的代码。我也试过使用many1 (satisfy (inClass "\n\r\t")),但仍然没有运气。您认为潜在的问题是什么?这也是link to the sample text file我一直在测试它。

import Data.Attoparsec.Text
import qualified Data.Text as T
import qualified Data.Text.IO as Txt

newtype Prose = Prose {
  word :: [Char]
}

instance Show Prose where
  show a = word a

optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())

specialChars = ['-', '_', '…', '“', '”', '\"', '\'', '’', '@', '#', '$',
                '%', '^', '&', '*', '(', ')', '+', '=', '~', '`', '{', '}',
                '[', ']', '/', ':', ';', ',']

inputPara :: Parser Prose
inputPara = Prose <$> many1' (letter <|> digit <|> space <|> satisfy (inClass specialChars) <|> satisfy (inClass "――.?!") )

paraSeparator :: Parser ()
paraSeparator = many1 (satisfy (isEndOfLine) <|> satisfy (isHorizontalSpace)) >> pure ()

paraParser :: String -> [Prose]
paraParser str = case parseOnly wp (T.pack str) of
    Left err -> error err
    Right x -> x
    where
      wp = optional paraSeparator *> inputPara `sepBy1` paraSeparator

main :: IO()
main = do
  input <- readFile "test.txt"
  let para = paraParser input
  print para
  print $ length para

1 个答案:

答案 0 :(得分:0)

问题是以下行中的space解析器:

inputPara = Prose <$> many1' (letter <|> digit <|> space <|> satisfy (inClass specialChars) <|> satisfy (inClass "――.?!") )

匹配\n\r\t等字符(每个字符isSpace

这就是为什么inputPara在没有分离的情况下匹配整个文本的原因。

其中一个解决方案可能是从space移除inputPara解析器并将' '字符添加到specialChars

例如,以下代码应该可以使用,但您可以随意选择最适合您的选项:

import Data.Attoparsec.Text
import qualified Data.Text as T
import qualified Data.Text.IO as Txt
import Control.Applicative ((<|>))

newtype Prose = Prose {
  word :: [Char]
}

instance Show Prose where
  show a = word a

optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())

specialChars = ['-', '_', '…', '“', '”', '\"', '\'', '’', '@', '#', '$',
                '%', '^', '&', '*', '(', ')', '+', '=', '~', '`', '{', '}',
                '[', ']', '/', ':', ';', ',', ' ']

inputPara :: Parser Prose
inputPara = Prose <$> many1' (letter <|> digit <|> satisfy (inClass specialChars) <|> satisfy (inClass "――.?!") )

paraSeparator :: Parser [Char]
paraSeparator = many1 space

paraParser :: String -> [Prose]
paraParser str = case parseOnly wp (T.pack str) of
    Left err -> error err
    Right x -> x
    where
      wp = optional paraSeparator *> inputPara `sepBy1` paraSeparator

main :: IO()
main = do
  input <- readFile "test.txt"
  let para = paraParser input
  print para
  print $ length para