如何使用DOMparser在div中进行网络抓取

时间:2018-10-02 01:41:32

标签: php html xpath web-scraping domdocument

我正在尝试获取div和其他页面,试图将其放在foreach中。 但面对一些麻烦,

<div class="article_info">
    <ul class="c-result_box">
     <li>
      <div class="inner cf">
       <div class="c-header">
         <div class="c-logo"> 
           <im src="/e/designs/31sumai/common/img/logo_08.png" alt="#"> 
             </div>
               <p class="c-supplier">三井のマンション</p>
                    <p class="c-name">
                        <a href="https://www.31sumai.com/mfr/K1503/" class="link" target="_blank">パークリュクス大阪天満</a>
                    </p>

我正在尝试在<a>元素中获取文本,这是我的代码,我在这里缺少什么?

$start_id = 1501;
while(true){

    $url = 'https://www.31sumai.com/mfr/K'.$start_id.'/outline.html';
    $html = file_get_contents($url);
    libxml_use_internal_errors(true);
    $DOMParser = new \DOMDocument();
    $DOMParser->loadHTML($html);
    $xpath = new \DOMXPath($DOMParser);

    $classname="c-name";
    $nodes = $finder->query("//*[contains(@class, '$classname')]");
    $MyTable = false; 
    $insertData = [];  
    foreach($nodes as $node){
        $allNames = [];
        foreach($node->getElementsByTagName('a') as $a){
            $name = $a->getElementsByTagName('a');
            $allProperties[] = [
                'names' => $name];
        }

    }

感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

您可以依靠XPath查询来提取所需的所有文本节点,然后在循环中获取nodeValue属性:

$start_id = "1501";
$url = "https://www.31sumai.com/mfr/K$start_id/outline.html";
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$xpath = new \DOMXPath($DOMParser);

$classname="c-name";

$nodes = $xpath->query("//*[contains(@class, '$classname')]/a/text()");
foreach($nodes as $node){
    echo $node->nodeValue;
}