正则表达式很棒

Question

有一个包含许多href的html代码。但我不需要所有的hrefs。我想只获得div中包含的href：

<div class="category-map second-links"> 
*****
</div> <p class="sec">

我希望看到的结果是：

<a href='xxx'>yyy</a>
<a href='zzz'>www</a>
...

我的版本（不工作）：

(?<=<div class=\"category-map second-links\">)(.+?(<a href=\".+?".+?>.+<\/a>))+(?=<\/div> <p class="sec">)

Answer 1

_{免责声明：您最好使用正确的HTML解析器。这个答案是出于教育目的，虽然它比普通的正则表达式更可靠，如果它是有效的html：P}

正则表达式很棒

所以我决定分两部分来做这件事：

匹配<div class="category-map second-links"></div>中的所有内容，即使它已嵌套。
循环浏览这些匹配项并匹配<a></a>，我选择保持简单，因为我不希望链接嵌套。

困难部分

所以这是正则表达式，我们将使用递归模式和xsi修饰符：

<div\s+class\s*=\s*"\s*category-map\s+second-links\s*"\s*>    # match a certain div with a certain classes
(?:                                                           # non-capturing group
   (?:<!--.*?-->)?                                            # Match the comments !
   (?:(?!</?div[^>]*>).)                                      # check if there is no start/closing tag
   |                                                          # or (which means there is)
   (?R)                                                       # Recurse the pattern, it's the same as (?0)
)*                                                            # repeat zero or more times
</div\s*>                                                     # match the closing tag
(?=.*?<p\s+class\s*=\s*"\s*sec\s*"\s*>)                       # make sure there is <p class="sec"> ahead of the expression

<强>改性剂：

s：在模式中使用点元字符匹配所有字符，包括换行符。
x：模式中的空白数据字符完全被忽略，除非转义或在字符类中，并且字符类外的未转义#与下一个换行符（包括在内）之间的字符是也被忽略了。这相当于Perl的/x修饰符，并且可以在复杂的模式中包含注释。
i：匹配不区分大小写的

简单部分

如果没有像a这样的疯狂内容，匹配未使用的<a title="</a>"></a>代码并不困难：

<a[^>]*>    # match the beginning a tag
.*?         # match everything ungreedy until ...
</a\s*>     # match </a       > or </a>
# Not forgetting the xsi modifiers

用PHP包装所有内容

$input = '<div class="category-map second-links"> 
*****
    <!--<div class="category-map second-links"> Comment hacks --> 
    <div class="category-map second-links">
        <a href=\'xxx\'>yyy</a>
        <a href=\'zzz\'>www</a>
...
    </div>
<div class="category-map second-links"> 
*****
    <!--<div class="category-map second-links"> Comment hacks --> 
    <div class="category-map second-links">
        <a href=\'aaa\'>bbb</a>
        <a href=\'ccc\'>ddd</a>
...
    </div>
</div> <p class="sec">';

$links = array();

preg_match_all('~
<div\s+class\s*=\s*"\s*category-map\s+second-links\s*"\s*>    # match a certain div with a certain classes
(?:                                                           # non-capturing group
   (?:<!--.*?-->)?                                            # Match the comments !
   (?:(?!</?div[^>]*>).)                                      # check if there is no start/closing tag
   |                                                          # or (which means there is)
   (?R)                                                       # Recurse the pattern, it\'s the same as (?0)
)*                                                            # repeat zero or more times
</div\s*>                                                     # match the closing tag
(?=.*?<p\s+class\s*=\s*"\s*sec\s*"\s*>)                       # make sure there is <p class="sec"> ahead of the expression
~sxi', $input, $matches);

if(isset($matches[0])){
    foreach($matches[0] as $match){
        preg_match_all('~
                            <a[^>]*>    # match the beginning a tag
                            .*?         # match everything ungreedy until ...
                            </a\s*>     # match </a       > or </a>
                        ~isx', $match, $tempLinks);
        if(isset($tempLinks[0])){
            array_push($links, $tempLinks[0]);
        }
    }
}

if(isset($links[0])){
    print_r($links[0]);
}else{
    echo 'empty :(';
}

在线演示

Hard part Easy part PHP code

参考

Answer 2

如果将HTML加载到DOM文档中，则可以使用Xpath从中查询节点。

文档中的所有元素：

//a

它有一个祖先/父div元素：

//a[ancestor:div]

使用类属性category-map second-links

//a[ancestor::div[@class = "category-map second-links"]]

获取已过滤元素的href属性（可选）

//a[ancestor::div[@class = "category-map second-links"]]/@href

完整示例：

$html = <<<'HTML'
<div class="category-map second-links"> 
*****
    <!--<div class="category-map second-links"> Comment hacks --> 
    <div class="category-map second-links">
        <a href='xxx'>yyy</a>
        <a href='zzz'>www</a>
...
    </div>
<div class="category-map second-links"> 
*****
    <!--<div class="category-map second-links"> Comment hacks --> 
    <div class="category-map second-links">
        <a href='aaa'>bbb</a>
        <a href='ccc'>ddd</a>
...
    </div>
</div> <p class="sec">
HTML;

$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);

// fetch the href attributes
$hrefs = array();
foreach ($xpath->evaluate('//a[ancestor::div[@class = "category-map second-links"]]/@href') as $node) {
  $hrefs[] = $node->value;
}
var_dump($hrefs);

// fetch the a elements an read some data from them
$linkData = array();
foreach ($xpath->evaluate('//a[ancestor::div[@class = "category-map second-links"]]') as $node) {
  $linkData[] = array(
    'href' => $node->getAttribute('@href'),
    'text' => $node->nodeValue,
  );
}
var_dump($linkData);

// fetch the a elements and store their html
$links = array();
foreach ($xpath->evaluate('//a[ancestor::div[@class = "category-map second-links"]]') as $node) {
  $links[] = $dom->saveHtml($node);
}
var_dump($links);

Answer 3

使用simpledomhtml

// Create DOM from URL
$html = file_get_html('<YOU_WEBSITE_URL_HERE>');

// Find specific tag
foreach($html->find('div.category-map.second-links a') as $anchor) {
    $anchors[] = $anchor;
}

print_r($anchors);

Answer 4

如果您想使用Regex，那么您可能会使用两个正则表达式查询一个用于获取所有div，每个div中的第二个找到href。

因为在这样的单一查询中

"<div.*?<a href='(?<data>.*?)'.*?</div>"

如果任何div有多个href，你将只获得一个href。

所以你可以用dom

来做到这一点

$dom->find('div a')->attrib('href');

我不确定上面的dom是％100工作但是我给你这个，因为暗示希望你可以为你做一个

正则表达式。在特定标签之间查找标签

4 个答案:

正则表达式很棒

困难部分

简单部分

用PHP包装所有内容

在线演示

参考