Question

我正在努力将一些博客文章转移到新的第三方家庭，并且需要用新的URL替换一些现有的URL。我不能使用XML，并且被迫使用需要在正则表达式中进行此搜索的包装类。我目前正在为html中存在的URL进行regex-ing时遇到问题。例如，如果html是：

<h1><a href="http://www.website.com/article/slug-that-has-undetermined-amount-of-hyphens/12345">Whatever</a></h1>

我需要我的正则表达式返回：

http://www.website.com/article/slug-that-has-undetermined-amount-of-hyphens/12345

网址的开头部分永远不会改变（"http://www.website.com/article/"部分）。但是，我不知道slug短语会是什么，但确实知道它们将包含一个未知的单词之间的连字符。 URL末尾的ID号可以是任何整数。

每篇文章中都有多个这些类型的链接，文章中还有其他类型的网址我想确定会被忽略，所以我不能只查找以http开头的短语在引号内。

FWIW：我正在使用php，目前我正在尝试使用preg_match_all返回所需网址数组

这是我最近的尝试：

$array_of_urls = [];
preg_match_all('/http:\/\/www\.website\.com\/article\/[^"]*/', $variable_with_html, $array_of_urls);
var_dump($array_of_urls);

然后我被nada倾倒了。任何帮助表示赞赏!!!

Answer 1

我们StackOverflow志愿者在解析html数据时必须坚持享受dom解析器的稳定性而不是正则表达式。

代码：（Demo）

$html=<<<HTML
<h1><a href="http://www.website.com/article/slug-that-has-undetermined-amount-of-hyphens/12345">Whatever</a></h1>
<p>Here is a url as plain text: http://www.website.com/article/sluggy-slug</p>
<div>Here is a qualifying link: <a href="http://www.website.com/article/slugger-sluggington-jr/666">Whatever</a></div>
HTML;

$dom = new DomDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $item) {
   $output[] = $item->getAttribute('href');
}
var_export($output);

输出：

array (
  0 => 'http://www.website.com/article/slug-that-has-undetermined-amount-of-hyphens/12345',
  1 => 'http://www.website.com/article/slugger-sluggington-jr/666',
)

如果由于某些疯狂的原因，以上内容对您的项目不起作用而您必须使用正则表达式，这应该足够了：

~<a.*?href="\K[^"]+~i  // using case-insensitive flag in case of all-caps syntax

Pattern Demo

如何使用regex获取引用的短语？

1 个答案: