Haskell中的递归标记化器

时间:2017-11-11 05:08:06

标签: haskell

我正在Haskell工作,准备测试。当前任务要求在以下公式后对字符串进行标记: 当运行“tokenize str separate remove”时,它应该输出一个字符串列表。出现在字符串“separate”中的“str”中的每个字符都应该是一个字符的字符串。应删除字符串“remove”中出现的“str”中的每个字符。不出现在单独或删除中的字符应捆绑在一起。

示例显示

 // fill a buffer with numbers 0 to 99 (0 to numVerts)
 const numVerts = 100;
 const indexData = new Uint16Array(numVerts);
 for (let i = 0; i < numVerts; ++i) {
   indexData[i] = i;
 }
 const indexBuffer = gl.createBuffer();
 gl.bindBuffer(gl.ELEMENT_ARRAY_BUFFER, indexBuffer);
 gl.bufferData(gl.ELEMENT_ARRAY_BUFFER, indexData, gl.STATIC_DRAW);

 // process 100 vertices from the buffers pointed to by the attributes
 // in order 0 to 99
 const offset  = 0;
 gl.drawElements(gl.POINTS, numVerts, gl.UNSIGNED_SHORT, offset);

应输出

tokenize "a + b* 12-def"   "+-*"   " "

我目前的代码

["a", "+", "b", "*", "12", "-", "def"]

它在某种程度上起作用,问题在于示例中的运算符与其前面的字母捆绑在一起。

像这样

tokenize :: String -> String -> String -> [String]
tokenize [] _ _  = []
tokenize [x] _ _ = [[x]]
tokenize (x:xs) a b     | x `elem` a = [x] : tokenize xs a b
                        | x `elem` b = tokenize xs a b
                        | otherwise = (x:head rest) : tail rest
                                where
                                        rest = tokenize xs a b

尽管操作员在单独的字符串中。

1 个答案:

答案 0 :(得分:1)

首先,tokenize [x] _ _可能不是您想要的,因为tokenize "a" "" "a"最终应该是["a"],而它应该是[]。其次,不要调用分隔符和删除列表String。他们只是[Char]。下面没有区别,因为type String = [Char],但同义词的目的是使语义更清晰,而你并没有真正使用String作为String s,所以你的功能不值得。另外,你应该将参数改组为tokenize seps rems str,因为这会使currying更容易。最后,您可能希望使用Data.Set代替[Char],但我不会在此处使用它来更接近问题。

问题本身是| otherwise = (x:head rest) : tail rest,它会将任何非特殊字符粘贴到下一个标记上,即使该标记应该是分隔符。在您的情况下,一个示例是head rest = "+"x = 'a'时,您加入它们就可以"a+"。你需要进一步保护。

(另外:你的缩进搞砸了:where子句绑定到整个等式,所以它在所有守卫中都是可见的。它应该缩进以便那个& #39;很清楚。)

tokenize :: [Char] -> [Char] -> String -> [String]
tokenize _ _ "" = []
tokenize seps rems (x:xs)
  | x `elem` rems                      = rest
  | x `elem` seps                      = [x]:rest
  -- Pattern guard: if rest has a single-char token on top and that token is a sep...
  | ([sep]:_) <- rest, sep `elem` seps = [x]:rest
  -- Otherwise, if rest has a token on top (which isn't a sep), grow it
  | (growing:rest') <- rest            = (x:growing):rest'
  -- Or else make a new token (when rest = [])
  | otherwise                          = [x]:rest
  where rest = tokenize seps rems xs

您也可以使用filter

tokenize seps rems = tokenize' . filter (not . flip elem rems)
  where tokenize' "" = []
        tokenize' (x:xs)
          | x `elem` seps                      = [x]:rest
          | ([sep]:_) <- rest, sep `elem` seps = [x]:rest
          | (growing:rest') <- rest            = (x:growing):rest'
          | otherwise                          = [x]:rest
          where rest = tokenize' xs