我一直在尝试解析一个带有英文文本的.txt文件。我的代码尝试返回该.txt文件中的段落数。出于某种原因,attoparsec似乎无法识别换行符或任何其他字符,例如\n\r\t
。以下是我的代码。我也试过使用many1 (satisfy (inClass "\n\r\t"))
,但仍然没有运气。您认为潜在的问题是什么?这也是link to the sample text file我一直在测试它。
import Data.Attoparsec.Text
import qualified Data.Text as T
import qualified Data.Text.IO as Txt
newtype Prose = Prose {
word :: [Char]
}
instance Show Prose where
show a = word a
optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())
specialChars = ['-', '_', '…', '“', '”', '\"', '\'', '’', '@', '#', '$',
'%', '^', '&', '*', '(', ')', '+', '=', '~', '`', '{', '}',
'[', ']', '/', ':', ';', ',']
inputPara :: Parser Prose
inputPara = Prose <$> many1' (letter <|> digit <|> space <|> satisfy (inClass specialChars) <|> satisfy (inClass "――.?!") )
paraSeparator :: Parser ()
paraSeparator = many1 (satisfy (isEndOfLine) <|> satisfy (isHorizontalSpace)) >> pure ()
paraParser :: String -> [Prose]
paraParser str = case parseOnly wp (T.pack str) of
Left err -> error err
Right x -> x
where
wp = optional paraSeparator *> inputPara `sepBy1` paraSeparator
main :: IO()
main = do
input <- readFile "test.txt"
let para = paraParser input
print para
print $ length para
答案 0 :(得分:0)
问题是以下行中的space
解析器:
inputPara = Prose <$> many1' (letter <|> digit <|> space <|> satisfy (inClass specialChars) <|> satisfy (inClass "――.?!") )
匹配\n\r\t
等字符(每个字符isSpace
)
这就是为什么inputPara
在没有分离的情况下匹配整个文本的原因。
其中一个解决方案可能是从space
移除inputPara
解析器并将' '
字符添加到specialChars
例如,以下代码应该可以使用,但您可以随意选择最适合您的选项:
import Data.Attoparsec.Text
import qualified Data.Text as T
import qualified Data.Text.IO as Txt
import Control.Applicative ((<|>))
newtype Prose = Prose {
word :: [Char]
}
instance Show Prose where
show a = word a
optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())
specialChars = ['-', '_', '…', '“', '”', '\"', '\'', '’', '@', '#', '$',
'%', '^', '&', '*', '(', ')', '+', '=', '~', '`', '{', '}',
'[', ']', '/', ':', ';', ',', ' ']
inputPara :: Parser Prose
inputPara = Prose <$> many1' (letter <|> digit <|> satisfy (inClass specialChars) <|> satisfy (inClass "――.?!") )
paraSeparator :: Parser [Char]
paraSeparator = many1 space
paraParser :: String -> [Prose]
paraParser str = case parseOnly wp (T.pack str) of
Left err -> error err
Right x -> x
where
wp = optional paraSeparator *> inputPara `sepBy1` paraSeparator
main :: IO()
main = do
input <- readFile "test.txt"
let para = paraParser input
print para
print $ length para