Question

我正在编写一个需要解析用Wiki标记语言编写的表的Perl程序。表语法使用管道字符＆＃39; |＆＃39;分隔列。

| row 1 cell 1    |row 1 cell 2  | row 1 cell 3|
| row 2 cell 1    | row 2 cell 2 |row 2 cell 3|

单元格可能包含零个或多个超链接，其语法如下所示：

[[wiki:path:to:page|Page Title]]   or
[[wiki:path:to:page]]

请注意，超链接可能包含管道符。然而，在这里引用了＃34;通过[[..]]括号。

超链接语法可能不是嵌套的。

为了匹配和捕获每个表行中的第一个单元格，

| Potatoes [[path:to:potatoes]]           | Daisies           |
| Kiki fruit [[path:to:kiwi|Kiwi Fruit]]  |             Lemons|

我试过了：

qr{\|                      # match literal pipe
    (.*?                   # non-greedy zero or more chars
        (?:\[\[.*?\]\])    # a hyperlink 
     .*?)                  # non-greedy zero or more chars
   \|}x                    # match terminating pipe

有效，$ 1包含单元格内容。

然后，匹配

| Potatoes            | Daisies           |

我尝试将超链接设为可选：

qr{\|                      # match literal pipe
    (.*?                   # non-greedy zero or more chars
        (?:\[\[.*?\]\])?   # <-- OPTIONAL hyperlink 
     .*?)                  # non-greedy zero or more chars
   \|}x                    # match terminating pipe

这有效，但在解析时

| Kiki fruit [[path:to:kiwi|Kiwi Fruit]]  |             Lemons|

我只有

 Kiki fruit [[path:to:kiwi

很明显，在给定选项的情况下，它决定忽略超链接模式并将嵌入式管道视为列分隔符。

我在这里被困住了。而且我仍然无法处理在单元格中多次出现超链接的可能性，或者在下一次迭代时将尾随管道作为引导管道。

在Perl的split函数中使用regexp是不必要的 - 如果它更容易，我可以自己编写分裂循环。我看到许多类似的问题被问到，但似乎没有一个问题与这个问题密切相关。

Answer 1

$ perl -MRegexp::Common -E '$_=shift; while (
  /\| # beginning pipe, and consume it
  (   # capture 1
    (?:  # inside the pipe we will do one of these:
      $RE{balanced}{-begin=>"[["}{-end=>"]]"} # something with balanced [[..]]
      |[^|] # or a character that is not a pipe
    )* # as many of those as necessary
  ) # end capture one
  (?=\|) # needs to go to the next pipe, but do not consume it so g works
  /xg
) { say $1 }' '| Kiki fruit [[path:to:kiwi|Kiwi Fruit]]  |             Lemons|'
 Kiki fruit [[path:to:kiwi|Kiwi Fruit]]  
             Lemons

这似乎提取了你正在寻找的那些。但是，我怀疑你用这种语言的正确解析器会更好。如果在cpan上没有什么东西我会感到惊讶，但即使不是，为此编写解析器可能仍然会更好，尤其是当你开始在表中获得需要处理的更奇怪的东西时。

regexp分割字符串但忽略带引号的分隔符

1 个答案: