如何在PHP中捕获带有可选空格的链接?

时间:2019-06-16 14:01:56

标签: php regex preg-match-all regex-group

file_get_contents中,我获得了网址的HTML代码。

$html = file_get_contents($url);

现在,我想捕获href链接。

HTML代码是:

<li class="four-column mosaicElement">
<a href="https://example.com" title="Lorem ipsum">
...
</a>
</li>
<li class="four-column mosaicElement">
<a href="https://example.org" title="Lorem ipsum">
...
</a>
</li>

所以我正在用这个:

preg_match_all('/class=\"four-column mosaicElement\"><a href=\"(.+?)\" title=\"(.+?)"/m', $html, $urls, PREG_SET_ORDER, 0);

foreach ($urls as $key => $url) {
    echo $url[1];
}

我该如何解决这个问题?

3 个答案:

答案 0 :(得分:3)

通过将正则表达式模式修改为以下内容,我就能使您的代码正常工作:

class="four-column mosaicElement">\s*<a href="(.+?)" title="(.+?)"
                                 ^^^^^

请注意,我允许外部标记(class)的<li>属性和内部锚点之间留有任意数量的空白。

这是您更新的脚本:

$html = "<li class=\"four-column mosaicElement\">\n<a href=\"https://example.com\" title=\"Lorem ipsum\">\n</a>\n</li>\n<li class=\"four-column mosaicElement\">\n<a href=\"https://example.org\" title=\"Lorem ipsum\">\n</a>\n</li>";
preg_match_all('/class="four-column mosaicElement">\s*<a href="(.+?)" title="(.+?)"/m', $html, $urls, PREG_SET_ORDER, 0);

foreach ($urls as $key => $url) {
    echo $url[1] . "\n";
}    

此打印:

https://example.com
https://example.org

答案 1 :(得分:3)

另一种选择是将DOMXPath与xpath表达式一起使用,该表达式查找具有两个类名的所有列表项,然后获取锚点:

//li[contains(@class, 'four-column') and contains(@class, 'mosaicElement')]/a

例如:

$string = <<<DATA
<li class="four-column mosaicElement">
<a href="https://example.com" title="Lorem ipsum">
</a>
</li>
<li class="four-column mosaicElement">
<a href="https://example.org" title="Lorem ipsum">
</a>
</li>
DATA;

$dom = new DOMDocument();
$dom->loadHTML($string);
$xpath = new DOMXpath($dom);

foreach($xpath->query("//li[contains(@class, 'four-column') and contains(@class, 'mosaicElement')]/a") as $v) {
    echo $v->getAttribute("href") . PHP_EOL;
}

结果

https://example.com
https://example.org

查看php demo

答案 2 :(得分:1)

在这里,为了以防万一,我们还可以使用具有正向超前和可选空格的表达式

(?=class="four-column mosaicElement")[\s\S]*?href="\s*(https?[^\s]+)\s*"

,我们想要的URL在此组中:

(https?[^\s]+)

DEMO

测试

$re = '/(?=class="four-column mosaicElement")[\s\S]*?href="\s*(https?[^\s]+)\s*"/m';
$str = '<li class="four-column mosaicElement">
<a href="https://example.com" title="Lorem ipsum">
...
</a>
</li>
<li class="four-column mosaicElement">
<a href="https://example.org" title="Lorem ipsum">

<li class="four-column mosaicElement">
<a href="   https://example.org   " title="Lorem ipsum">

<li class="four-column mosaicElement">
<a href="   https://example.org                " title="Lorem ipsum">
...
</a>
</li>
';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

foreach ($matches as $key => $url) {
    echo $url[1] . "\n";
}

输出

https://example.com
https://example.org
https://example.org
https://example.org

RegEx电路

jex.im可视化正则表达式:

enter image description here