Question

我有一个很长的文字，可以在哪里链接 schema://example.com/{entity}/{id}。

我需要提取它们看起来像：

{entity1} => {id1}
{entity1} => {id2}
{entity2} => {id3}
{entity2} => {id4}

我可以用

提取所有网址

\bschema:\/\/(?:(?!&[^;]+;)[^\s"'<>)])+\b

然后用

解析它

schema:\/\/example\.com\./(.*)\/(.*)

但我需要更优化的方式。你能帮帮我吗？

Answer 1

不确定我是否理解问题的复杂性，但这应该做你需要的。

我使用模式捕获实体和id，然后我将它们与array_combine结合。

Preg_match_all("~schema://example.com/(.*?)/(.*?)(\.|\s|$)~", $txt, $matches);

$arr = array_combine($matches[1],$matches[2]);
Var_dump($arr);

https://3v4l.org/NGrFQ

Answer 2

与所有正则表达式任务一样，您可以通过使用＆＃34;否定字符类来提高效率＆＃34;并最大限度地减少你的＆＃34;捕获组＆＃34;。

Demo Link（Pattern #1 62 steps）（Pattern #2 60 steps & smaller output array）

$string="bskdkbfnz schema://example.com/bob/1. flslnenf. Ddndkdn schema://example.com/john/2";

// This one uses negated characters classes with 2 capture groups
var_export(preg_match_all("~\bschema://example\.com/([^/]*)/([^.\s]*)~",$string,$out)?array_combine($out[1],$out[2]):'no matches');

echo "\n";
// This one uses negated character classes with 1 capture group. \K restarts the fullstring match.
var_export(preg_match_all("~\bschema://example\.com/([^/]*)/\K[^.\s]*~",$string,$out)?array_combine($out[1],$out[0]):'no matches');

输出：

array (
  'bob' => '1',
  'john' => '2',
)
array (
  'bob' => '1',
  'john' => '2',
)

如果您发现第二个目标子字符串由于某个字符而匹配得太远，只需将该字符添加到否定字符类中即可。

我对数据的可变性无法100％自信，但如果entity子字符串始终为小写字母，则可以使用[a-z]。如果id子字符串始终为数字，则可以使用\d。这个决定需要对预期的输入字符串有深入的了解。

正则表达式用于在文本中查找URL并将其解析为uri

2 个答案: