如何使用带有HTMLCleaner for Android </div> </h3>的xpath检索嵌套<div>标签内的<h3>标签内的文本

时间:2014-01-22 05:53:21

标签: android xpath htmlcleaner

我想提取下面粘贴的HTML页面部分中存在的“Catholic Blended Margaritas”文本。

我使用了以下xPath表达式:

xPath = "//div[@class='recipeBox']/div[@class='detailBox']/h3/text()";

我把它传递给了HTMLCleaner,我在这里粘贴了一部分代码:

//use the cleaner to "clean" the HTML and return it as a TagNode object i.e. HTML page root node
    TagNode rootNode = htmlCleaner.clean(new   InputStreamReader(conn.getInputStream()));   

    // query XPath  
    Object[] nodes = rootNode.evaluateXPath(xpath);   

但上面的表达式返回零节点。

我粘贴的Html部分。事实上,我想要所有这些节点的文本,我只粘贴了一部分Html。 HTML页面的链接供您参考:http://www.foodfood.com/category/recipes/by-course/beverages/

以上链接的部分Html如下:

<div class="recipeBox ">
        <a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/" rel="bookmark" title="Permanent Link to Catholic Blended Margaritas">
            <div class="pic">
                <img width="230" height="150" src="http://www.foodfood.com/wp-content/uploads/2012/07/230x150xCatholic-Blended-Margaritas-230x150.jpg.pagespeed.ic.p_7Vr37LwJ.jpg" class="post_img_thumb wp-post-image" alt="Catholic-Blended-Margaritas" title="Catholic-Blended-Margaritas"/>             </div>
            <div class="detailBox">
                <h3>Catholic Blended Margaritas</h3>
                <p><p>Blended Margaritas is a delicious drink which can be enjoyed on any festive</p>
</p>
                <div class="timer">5 Mins</div>
                <a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/?comments=1#comments_det"><span class="comments">No Comments</span> </a>
            </div>
        </a>
    </div>

请注意“天主教混合玛格丽塔酒”(我想要的)文本嵌套在两个<div>标签内,这给我带来了问题。

1 个答案:

答案 0 :(得分:0)

我在您的示例页面中看到了//div[@class='recipeBox']//div[@class='detailBox']/h3/text()的2个问题:

  • <div class="recipeBox ">
  • 的“class”属性中的尾随空格
  • 将目标元素嵌套在<a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/" rel="bookmark" title="Permanent Link to Catholic Blended Margaritas">链接

所以我建议你试试//div[normalize-space(@class)='recipeBox']//div[@class='detailBox']/h3/text()