Question

我试图使用php bot从外部网站提取链接。链接在

内

<td class=" title-col"> <a href="http://examplenews101.com/post1">News 1</a> </td>

请注意“title-col”之前有一个空格。

以下是Im使用的脚本无法提取链接

function crawl_page($url, $depth = 5)   {
static $seen = array();
if (isset($seen[$url]) || $depth === 0) {
    return;
}

$seen[$url] = true;

$dom = new DOMDocument('1.0');
//als tried true , but no change in results
$dom->preserveWhiteSpace = false;
@$dom->loadHTMLFile($url);
$xpath = new DOMXpath($dom);
$td = $xpath->query('//td[contains(concat(" ", normalize-space(@class), " "), "title-col")]');
// also tried this, but not working
//$td = $xpath->query('//td[contains(@class,"title-col")]');

//I only get values when I use this
//$td = $dom->getElementsByTagName('td');

foreach( $td as $t )  {
    $anchors  = $t->getElementsByTagName('a'); 
    foreach ($anchors  as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
            $href = http_build_url($url, array('path' => $path));
            } 
            else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) && isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= $path;
            }
        }
        crawl_page($href, $depth - 1);
   }
}

echo "URL:" . $url . "<br/>";

}

我只在使用此

时获取值

$td = $dom->getElementsByTagName('td');

但我需要按班级查询。

谢谢

Answer 1

我发现这是由于javascript生成的属性。

如果类名包含空格，则在DOMXpath中使用查询将不起作用

1 个答案: