获取<a> into </a> <div class =“”> <a>?

时间:2017-09-30 15:13:41

标签: web-crawler html-parsing

I'm looking for a way to crawl in php the value of a <a> that does not have a class or id, but that is inside a <div> that has a class.

Here is the html code to crawler:

<div class="myclass">
    <a href="/to">value to crawl</a>
</div>

Here is the line of my php code (unsuccessfully):

preg_match_all('<div class=\"myclass\"><a>(.*)<\/a><\/div>', $myhtml, $match);

thank for your response :)

1 个答案:

答案 0 :(得分:1)

解析器是一个更好的解决方案:

$html = '<div class="myclass">
    <a href="/to">value to crawl</a>
</div>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$a_s = $xpath->query('*/div[contains(@class, \'myclass\')]/a');
foreach($a_s as $a) {
    if(empty($a->getAttribute('class')) && empty($a->getAttribute('id'))) {
        echo $a->nodeValue;
    } else {
        echo 'not';
    }
}

https://3v4l.org/YmCAv

你的问题的答案是:

  1. <a>在您的字符串中不存在
  2. 正则表达式需要PHP中的分隔符
  3. ><也不存在于您的字符串
  4. 正斜杠和双引号不需要转义,除非它们被使用,它们在正则表达式中没有特殊含义。 (在下面的回答中,我使用正斜杠作为分隔符,所以我保留它逃脱)
  5. 所以要纠正你的正则表达式,那就是:

    /<div class="myclass">\s*<a.*?>(.*?)<\/a>\s*<\/div>/
    

    演示:https://regex101.com/r/0tfwDu/1/