Question

我想从Google获取所有自然搜索结果。

我需要帮助定义XPath以排除广告。广告上的引号不包含类属性，有机结果有2个不同的类值。我尝试定义XPath失败了。 Google搜索结果页面如下所示

Ad
<cite>example.com</cite> 

Organic Result 1 
<cite class="_Rm">example.com/page1.html</cite> 

Organic Result 2
<cite class="_Rm bc">example.com > Breadcrumbs > Page2</cite>

这是我的代码：

$html = new DOMDocument();
@$html->loadHtmlFile('http://www.google.com/search?q=mortgage&num=100');
$xpath = new DOMXPath($html);
$nodes = $xpath->query('//cite');

foreach ($nodes as $n){
echo $n->nodeValue.'<br />'; // Show all links
}

请帮忙

Answer 1

尝试//cite[@class='_Rm' or @class='_Rm bc']这将选择cite或_Rm类的_RM bc个节点。

Answer 2

假设您想要获取的HTML部分不是由客户端脚本（通常是javascript）生成的，那么遵循简单的XPath就可以完成这项任务：

$nodes = $xpath->query('//cite[@class]');

以上XPath获取包含具有任何值的class属性的所有<cite>标记。

否则，您需要找到一种方法来运行客户端脚本，以便在对HTML应用上述XPath查询之前可以完全生成HTML。

XPath - ＆gt;选择具有class属性的元素

2 个答案: