Question

NB。我使用this Alex template from Simon Marlow。

我想为C风格的评论创建词法分析器。我目前的方法是创建单独的标记来开始评论，结束，中间和在线

%wrapper "monad"

tokens :-
  <0> $white+ ;
  <0> "/*"               { mkL LCommentStart `andBegin` comment }
  <comment> .            { mkL LComment }
  <comment> "*/"         { mkL LCommentEnd `andBegin` 0 }
  <0> "//" .*$           { mkL LSingleLineComment }

data LexemeClass
  = LEOF
  | LCommentStart
  | LComment
  | LCommentEnd
  | LSingleLineComment

如何减少中间令牌的数量？对于输入/*blabla*/，我将获得8个令牌而不是一个！
如何从单行注释令牌中删除//部分？
是否可以在没有monad包装器的情况下提出评论？

Answer 1

看看这个：

http://lpaste.net/107377

用类似的东西进行测试：

echo "This /* is a */ test" | ./c_comment

应打印：

Right [W "This",CommentStart,CommentBody " is a ",CommentEnd,W "test"]

您需要使用的关键alex例程是：

alexGetInput -- gets the current input state
alexSetInput -- sets the current input state
alexGetByte  -- returns the next byte and input state
andBegin     -- return a token and set the current start code

每个例程commentBegin，commentEnd和commentBody都有以下签名：

AlexInput -> Int -> Alex Lexeme

其中Lexeme代表您的令牌类型。 AlexInput参数的格式（对于monad包装器）：

（AlexPosn，Char，[Bytes]，String）

Int参数是存储在String字段中的匹配长度。因此，大多数令牌处理程序的形式将是：

handler :: AlexInput -> Int -> Alex Lexeme
handler (pos,_,_,inp) len = ... do something with (take len inp) and pos ...

一般来说，处理程序似乎可以忽略Char和[Bytes]字段。

处理程序commentBegin和commentEnd可以忽略AlexInput和Int个参数，因为它们只匹配固定长度的字符串。

commentBody处理程序通过调用alexGetByte来累积评论正文直到＆＃34; * /＆＃34;找到了。据我所知，C评论可能没有嵌套，因此评论在第一次出现时结束＆＃34; * /＆＃34;。

请注意，评论正文的第一个字符位于match0变量中。事实上，我的代码中有一个错误，因为它不匹配＆＃34; / ** /＆＃34;正确。它应该看match0来决定是从loop还是loopStar开始。

您可以使用相同的技术来解析＆＃34; //＆＃34;样式注释 - 或任何需要非贪婪匹配的令牌。

另一个关键点是像$white+这样的模式使用起始代码进行限定：

<0>$white+

这样做是为了在处理评论时不活动。

您可以使用其他包装器，但请注意AlexInput类型的结构可能不同 - 例如对于基本包装器，它只是一个3元组：(Char,[Byte],String)。只需查看生成的.hs文件中AlexInput的定义。

最后一点......使用++累积字符当然效率很低。您可能希望使用Text（或ByteString）作为累加器。

如何用Alex lexer解析C风格的评论？

1 个答案: