Question

我在Haskell中使用ReadP为文件中的数字列表做了一个非常简单的解析器。它有效，但它很慢......这种解析器的这种正常行为还是我做错了什么？

import Text.ParserCombinators.ReadP
import qualified Data.IntSet as IntSet
import Data.Char

setsReader :: ReadP [ IntSet.IntSet ]
setsReader = 
    setReader `sepBy` ( char '\n' )

innocentWhitespace :: ReadP ()
innocentWhitespace = 
    skipMany $ (char ' ') <++ (char '\t' )

setReader :: ReadP IntSet.IntSet
setReader =  do 
    innocentWhitespace
    int_list <- integerReader `sepBy1`  innocentWhitespace
    innocentWhitespace 
    return $ IntSet.fromList int_list

integerReader :: ReadP Int
integerReader = do
    digits <- many1 $ satisfy isDigit 
    return $ read digits

readClusters:: String -> IO [ IntSet.IntSet ]
readClusters filename = do 
    whole_file <- readFile filename 
    return $ ( fst . last ) $ readP_to_S setsReader whole_file

Answer 1

setReader具有指数行为，因为它允许数字之间的空格可选。对于这一行：

12 34 56

看到这些解析：

[1,2,3,4,5,6]
[12,3,4,5,6]
[1,2,34,5,6]
[12,34,5,6]
[1,2,3,4,56]
[12,3,4,56]
[1,2,34,56]
[12,34,56]

你可以看到这对于长线来说可能会失控。 ReadP以递增的长度顺序返回所有有效解析，因此要到达最后一个解析，您必须遍历所有这些中间解析。变化：

int_list <- integerReader `sepBy1` innocentWhitespace

要：

int_list <- integerReader `sepBy1` mandatoryWhitespace

用mandatoryWhitespace的合适定义来压缩这种指数行为。 parsec使用的解析策略更能抵抗这种错误，因为它是贪婪的 - 一旦它消耗给定分支中的输入，它就会被提交到该分支并且永远不会返回（除非您明确要求它）。因此，一旦正确解析12，它就永远不会回到解析1 2。当然，这意味着你说出你的选择的顺序很重要，我总是觉得有点难以思考。

我也会用：

head [ x | (x,"") <- readP_to_S setsReader whole_file ]

要提取有效的整个文件解析，以防它非常快速地消耗所有输入但是有数百种方法来解释该输入。如果你不关心歧义，你可能宁愿它返回第一个而不是最后一个，因为第一个会更快到达。

纠正Haskell中的ReadP用法

1 个答案: