我想提取下面粘贴的HTML页面部分中存在的“Catholic Blended Margaritas”文本。
我使用了以下xPath表达式:
xPath = "//div[@class='recipeBox']/div[@class='detailBox']/h3/text()";
我把它传递给了HTMLCleaner,我在这里粘贴了一部分代码:
//use the cleaner to "clean" the HTML and return it as a TagNode object i.e. HTML page root node
TagNode rootNode = htmlCleaner.clean(new InputStreamReader(conn.getInputStream()));
// query XPath
Object[] nodes = rootNode.evaluateXPath(xpath);
但上面的表达式返回零节点。
我粘贴的Html部分。事实上,我想要所有这些节点的文本,我只粘贴了一部分Html。 HTML页面的链接供您参考:http://www.foodfood.com/category/recipes/by-course/beverages/
以上链接的部分Html如下:
<div class="recipeBox ">
<a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/" rel="bookmark" title="Permanent Link to Catholic Blended Margaritas">
<div class="pic">
<img width="230" height="150" src="http://www.foodfood.com/wp-content/uploads/2012/07/230x150xCatholic-Blended-Margaritas-230x150.jpg.pagespeed.ic.p_7Vr37LwJ.jpg" class="post_img_thumb wp-post-image" alt="Catholic-Blended-Margaritas" title="Catholic-Blended-Margaritas"/> </div>
<div class="detailBox">
<h3>Catholic Blended Margaritas</h3>
<p><p>Blended Margaritas is a delicious drink which can be enjoyed on any festive</p>
</p>
<div class="timer">5 Mins</div>
<a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/?comments=1#comments_det"><span class="comments">No Comments</span> </a>
</div>
</a>
</div>
请注意“天主教混合玛格丽塔酒”(我想要的)文本嵌套在两个<div>
标签内,这给我带来了问题。
答案 0 :(得分:0)
我在您的示例页面中看到了//div[@class='recipeBox']//div[@class='detailBox']/h3/text()
的2个问题:
<div class="recipeBox ">
<a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/" rel="bookmark" title="Permanent Link to Catholic Blended Margaritas">
链接所以我建议你试试//div[normalize-space(@class)='recipeBox']//div[@class='detailBox']/h3/text()