有一段amazon.com我想从中提取每个项目的数据(仅限节点值,而不是链接)。
我正在寻找的值是<span class="narrowValue">
<ul data-typeid="n" id="ref_1000">
<li style="margin-left: -18px">
<a href="/s/ref=sr_ex_n_0?rh=i%3Aaps%2Ck%3Ahow+to+grow+tomatoes&sort=salesrank&keywords=how+to+grow+tomatoes&ie=UTF8&qid=1327603358">
<span class="expand">Any Department</span>
</a>
</li>
<li style="margin-left: 8px">
<strong>Books</strong>
</li>
<li style="margin-left: 6px">
<a href="/s/ref=sr_nr_n_0?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A48&bbn=1000&sort=salesrank&keywords=how+to+grow+tomatoes&ie=UTF8&qid=1327603358&rnid=1000">
<span class="refinementLink">Crafts, Hobbies & Home</span><span class="narrowValue">(19)</span>
</a>
</li>
<li style="margin-left: 6px">
<a href="/s/ref=sr_nr_n_1?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A10&bbn=1000&sort=salesrank&keywords=how+to+grow+tomatoes&ie=UTF8&qid=1327603358&rnid=1000">
<span class="refinementLink">Health, Fitness & Dieting</span><span class="narrowValue">(3)</span>
</a>
</li>
<li style="margin-left: 6px">
<a href="/s/ref=sr_nr_n_2?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A6&bbn=1000&sort=salesrank&keywords=how+to+grow+tomatoes&ie=UTF8&qid=1327603358&rnid=1000">
<span class="refinementLink">Cookbooks, Food & Wine</span><span class="narrowValue">(2)</span>
</a>
</li>
</ul>
我怎么能用XPath做到这一点?
代码来自链接amazon kindle search
目前正在尝试
$rank=array();
$words = $xpath->query('//ul[@id="ref_1000"]/li/a/span[@class="refinementLink"]');
foreach ($words as $word) {
$rank[]=(trim($word->nodeValue));
}
var_dump($rank);
答案 0 :(得分:2)
以下表达式应该有效:
//*[@id='ref_1000']/li/a/span[@class='narrowValue']
为了获得更好的性能,您可以提供指向此表达式开头的直接路径,但提供的表达式更灵活(假设您可能需要在多个页面上工作)。
另请注意,您的HTML解析器可能会生成与Firebug(我测试过的)生成的结果树不同的结果树。这是一个更灵活的解决方案:
//*[@id='ref_1000']//span[@class='narrowValue']
灵活性带来潜在的性能(和准确性)成本,但它通常是处理标签汤的唯一选择。
答案 1 :(得分:2)
如果您需要绘制类别名称:
// Suppress invalid markup warnings
libxml_use_internal_errors(true);
// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html); // $html - string fetched by CURL
$xml = simplexml_import_dom($doc);
// Find a category nodes
$categories = $xml->xpath("//span[@class='refinementLink']");
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
// Select the parent node
$categories = $xpath->query("//span[@class='refinementLink']/..");
foreach ($categories as $category) {
echo '<pre>';
echo $category->childNodes->item(1)->firstChild->nodeValue;
echo $category->childNodes->item(2)->firstChild->nodeValue;
echo '</pre>';
// Crafts, Hobbies & Home (19)
}
答案 2 :(得分:-2)
我强烈建议您查看phpQuery library。它本质上是PHP的jQuery选择器引擎,所以要获得你想要的文本,你可以做类似的事情:
foreach (pq('span.refinementLink') as $p) {
print $p->text() . "\n";
}
那应该输出如下内容:
Crafts, Hobbies & Home
Health, Fitness & Dieting
Cookbooks, Food & Wine
到目前为止,它是迄今为止最简单的屏幕抓取,DOM解析我知道的PHP。