Question

我正在尝试创建一个匹配UTF-8编码输入中任何Unicode标点符号的LPeg模式。我想出了以下Selene Unicode和LPeg的结合：

local unicode     = require("unicode")
local lpeg        = require("lpeg")
local punctuation = lpeg.Cmt(lpeg.Cs(any * any^-3), function(s,i,a)
  local match = unicode.utf8.match(a, "^%p")
  if match == nil
    return false
  else
    return i+#match
  end
end)

这似乎有效，但它会遗漏多个Unicode代码点组合的标点字符（如果存在这样的字符），因为我只读取前面的4个字节，它可能会破坏解析器的性能，它是未定义库match函数将执行的操作，当我向其提供包含runt UTF-8字符的字符串时（虽然它似乎可以正常工作）

。

我想知道这是否是一种正确的方法，或者是否有更好的方法来实现我想要实现的目标。

Answer 1

the LPeg homepage中的示例显示了匹配UTF-8字符的正确方法。 UTF-8字符的第一个字节决定了它的一部分字节数：

local cont = lpeg.R("\128\191") -- continuation byte

local utf8 = lpeg.R("\0\127")
           + lpeg.R("\194\223") * cont
           + lpeg.R("\224\239") * cont * cont
           + lpeg.R("\240\244") * cont * cont * cont

在此utf8模式的基础上，我们可以使用lpeg.Cmt和Selene Unicode match函数，就像您提议的那样：

local punctuation = lpeg.Cmt(lpeg.C(utf8), function (s, i, c)
    if unicode.utf8.match(c, "%p") then
        return i
    end
end)

请注意，我们会返回i，这与Cmt期望的内容一致：

给定函数获取整个主题的参数，当前位置（在patt匹配之后），以及patt生成的任何捕获值。函数返回的第一个值定义了匹配的发生方式。如果通话返回一个号码，则匹配成功，返回的号码成为新的当前位置。

这意味着我们应该返回函数接收的相同数字，即UTF-8字符后面的位置。

使用LPeg匹配Unicode标点符号

1 个答案: