Question

我想提取下面粘贴的HTML页面部分中存在的“Catholic Blended Margaritas”文本。

我使用了以下xPath表达式：

xPath = "//div[@class='recipeBox']/div[@class='detailBox']/h3/text()";

我把它传递给了HTMLCleaner，我在这里粘贴了一部分代码：

//use the cleaner to "clean" the HTML and return it as a TagNode object i.e. HTML page root node
    TagNode rootNode = htmlCleaner.clean(new   InputStreamReader(conn.getInputStream()));   

    // query XPath  
    Object[] nodes = rootNode.evaluateXPath(xpath);

但上面的表达式返回零节点。

我粘贴的Html部分。事实上，我想要所有这些节点的文本，我只粘贴了一部分Html。 HTML页面的链接供您参考：http://www.foodfood.com/category/recipes/by-course/beverages/

以上链接的部分Html如下：

<div class="recipeBox ">
        <a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/" rel="bookmark" title="Permanent Link to Catholic Blended Margaritas">
            <div class="pic">
                <img width="230" height="150" src="http://www.foodfood.com/wp-content/uploads/2012/07/230x150xCatholic-Blended-Margaritas-230x150.jpg.pagespeed.ic.p_7Vr37LwJ.jpg" class="post_img_thumb wp-post-image" alt="Catholic-Blended-Margaritas" title="Catholic-Blended-Margaritas"/>             </div>
            <div class="detailBox">
                <h3>Catholic Blended Margaritas</h3>
                <p><p>Blended Margaritas is a delicious drink which can be enjoyed on any festive</p>
</p>
                <div class="timer">5 Mins</div>
                <a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/?comments=1#comments_det"><span class="comments">No Comments</span> </a>
            </div>
        </a>
    </div>

请注意“天主教混合玛格丽塔酒”（我想要的）文本嵌套在两个<div>标签内，这给我带来了问题。

Answer 1

我在您的示例页面中看到了//div[@class='recipeBox']//div[@class='detailBox']/h3/text()的2个问题：

<div class="recipeBox ">
将目标元素嵌套在<a href="http://www.foodfood.com/recipes/catholic-blended-margaritas/" rel="bookmark" title="Permanent Link to Catholic Blended Margaritas">链接

所以我建议你试试//div[normalize-space(@class)='recipeBox']//div[@class='detailBox']/h3/text()

如何使用带有HTMLCleaner for Android </div> </h3>的xpath检索嵌套<div>标签内的<h3>标签内的文本

1 个答案: