Question

我希望能够提取标签名称和查询值。

给出以下查询：

title:(Harry Potter) abc def author:'John' rating:5 jhi cost:"2.20" lmnop qrs

我希望能够提取以下信息：

title => Harry Potter
author => John
rating => 5
cost => 2.20
rest => abc def jhi lmnop qrs

请注意，标记值可以包含在“..”，“...”或（...）中。它很重要。

使用以下方法解决了此问题：

$query = "..."; // User input

while (preg_match(
    '@(?P<key>title|author|rating|cost):(?P<value>[^\'"(\s]+)@',
    $query,
    $matches
)) {
    echo $matches['key'] . " => " . $matches['value'];
    $query = trim(str_replace($matches[0], '', $query));
}

while (preg_match(
    '@(?P<key>title|author|rating|cost):[\'"(](?P<value>[^\'")]+)[\'")]@',
    $query,
    $matches
)) {
    echo $matches['key'] . " => " . $matches['value'];
    $query = trim(str_replace($matches[0], '', $query));
}

现在这对很多情况都没问题。但是，有很多极端情况：

1）例如考虑：

title:(John's) abc

应该去：

title => John's
rest => abc

但转到

title => (John'
rest => s) abc

2）还要考虑：

title: (foo (: bar)

应该去：

title => foo (: bar

转到：

rest => (foo (bar)

我该怎么做？正则表达式是最好的方式吗？我怎么能解决这个问题？

更新修复了其中一个预期输出的错误

Answer 1

不可能像你一样使用一个正则表达式完全解析所有内容，因为你对所有对（密钥，值）没有相同的规则。实际上，例如，在 author 标签的中间接受一个紧密的括号，但不在 title 的中间。单引号标记将在 title 中间接受，但不在 author 等中间接受。因此，即使您的规则适用于大多数情况，您的第二个捕获组无法正确定义。

改进解决方案的一种方法是为每个标记使用不同的正则表达式。然后你可以做这样的事情：

$str   = "title:(foo (: bar) abc def ".
         "author:'John' "             .
         "rating:5 jhi "              .
         "cost:\"2.20\""              .
         "lmnop qrs ";


$regex = array(
  "title"  => "/(?P<key>title):[[:space:]]*\((?P<value>[^\)]*)\)/"       ,
  "author" => "/(?P<key>author):[[:space:]]*'(?P<value>[^']*)'/"         ,
  "rating" => "/(?P<key>rating):[[:space:]]*(?P<value>[\d]+)/"           ,
  "cost"   => "/(?P<key>cost):[[:space:]]*\"(?P<value>[\d]+\.[\d]{2})\"/"
  );

foreach($regex as $k => $r)
{
  if(preg_match($r, $str, $matches))
  {
    echo $matches['key'] . " => " . $matches['value'] . "\n";
  }
  else
  {
    echo "Nothing found for " . $k . "\n";
  }
}

但请注意，此解决方案不是防弹。例如，如果图书的标题包含字符串 author：＆＃39; JOHN＆＃39; ，则会出现问题。

在我看来，避免此类问题的最佳方法是为输入字符串定义一个语法规则，并拒绝所有不符合规则的字符串。嗯，这也取决于您的要求和我的应用程序。

修改

请注意，标记值可以包含在＆＃39; ..＆＃39;，＆＃34; ...＆＃34;要么（...）。它有什么意义

在这种情况下，你的问题仍然是那个

[\'\"$](?P<value>[^\'\"$]+)[\'\"\)]

不正确。相反，您希望每对分隔符匹配。在子模式中有一个选项（参考here）

(?|\'(?P<value>[^\']+)\'|\"(?P<value>[^\"]+)+\"|$(?P<value>[^$]+)\))

如果使用\作为转义字符，代码将变为

$str = 'title:"foo \" bar" abc def '. 'author:(Joh\)n) ' . 'rating:\'5\\\'4\' jhi ' . 'cost:"2.20"' . 'lmnop qrs '; $regex = "/(?P<key>title|author|rating|cost):[[:space:]]*" . "(?|" . "\"(?P<value>(?:(?:\\\\\")|[^\"])+)\"" . "|" . // matches "..." "\'(?P<value>(?:(?:\\\\\')|[^\'])+)\'" . "|" . // matches '...' "$(?P<value>(?:(?:\\\\$)|[^\)])+)\)" . // matches (...) ")/"; // close (?|... while(preg_match($regex, $str, $matches)) { echo $matches['key'] . " => " $matches['value'] . "\n"; $str = str_replace($matches[0], '', $str); }

输出

title => foo \" bar author => Joh\)n rating => 5\'4 cost => 2.20

使用正则表达式提取标记名称和值

1 个答案: