Question

以下正则表达式使用＆quot; preg_match_all＆＃39;从页面中提取所有href：

/\s+href\s*=\s*[\"\']?([^\s\"\']+)[\"\'\s]+/ims

如果有一个＆＃39; rel＆＃39; ＆＃39; a＆＃39;中的属性标签我想用结果返回。如何修改顶部的代码以包含＆＃39; rel＆＃39;属性（如果存在）？

更新：以下内容：

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do 
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut 
enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi 
ut aliquip ex ea commodo consequat. <a href="http://example.com" rel="nofollow">Duis</a>
nirure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat
nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui
officia deserunt mollit anim id est laborum.

返回：

Array
(
    [0] => Array
        (
            [0] =>  href="http://example.com" 
        )

    [1] => Array
        (
            [0] => http://example.com
        )

)

我希望它返回：

Array
(
    [0] => Array
        (
            [0] =>  href="http://example.com" rel="nofollow"
        )

    [1] => Array
        (
            [0] => http://example.com
        )

)

Answer 1

\s+href\s*=\s*[\"\']?(([^\s\"\']+)[\"\'\s]+rel="[^"]*")|\s+href\s*=\s*[\"\']?([^\s\"\']+)[\"\'\s]+

您可以使用此功能。如果存在，则会提供rel。

参见演示。

http://regex101.com/r/jT3pG3/4

Answer 2

可以选择使用lookahead：

捕获它

$regex = '~<a\b(?=(?>[^>]*rel\s*=\s*["\']([^"\']+))?)[^>]*href=\s*["\']\s*\K[^"\']+~';

在结束modifier i (PCRE_CASELESS)后添加~ delimiter以匹配不区分大小写的。

请参阅进一步说明以及example on regex101和SO Regex FAQ

使用preg_match_all可能想要添加PREG_SET_ORDER标志：

preg_match_all($regex, $str, $out, PREG_SET_ORDER);
print_r($out);

结果如下：

Array
(
    [0] => Array
        (
            [0] => http://example.com
            [1] => nofollow
        )

    [1] => Array
        (
            [0] => http://example2.com
            [1] => nofollow
        )

)

请参阅test at eval.in

正如其他人提到的，正则表达式不是用于解析html的perfect means。取决于您将要实现的目标以及输入的外观/是否是您的输入并知道会发生什么。

用href提取rel

2 个答案: