使用Jsoup从网页中提取信息

时间:2011-03-25 15:04:34

标签: html jsoup

我想使用Jsoup从buy.com页面中提取评论和评分信息。问题是我似乎无法弄清楚如何这样做,因为每个评论的ID根据其编号而不同。例如,第11号评论看起来像这样:

<a id="CustomerReviews_customerReviews_ctl11_reviewIdAnchor" name="a352496">&nbsp;</a><br />

<span id="CustomerReviews_customerReviews_ctl11_ratingInfo"><span class="blueText"><b>5</b> of <b>5</b></span> <b>Great Product</b> 12/15/2010<br /></span>

<span id="CustomerReviews_customerReviews_ctl11_reviewerInfo"><b>A customer</b> from x<br></span>

<span id="CustomerReviews_customerReviews_ctl11_reviewContent">content</span>

而评论编号12将具有id:ctl12 如何提取页面中所有评论的评论内容和评分?

1 个答案:

答案 0 :(得分:1)

我有点晚了,但我希望它可以帮助你和其他人找到同样的问题!

你应该尝试这样的事情:

String code1 = "<span id=\"CustomerReviews_customerReviews_ctl11_ratingInfo\"><span class=\"blueText\"><b>1</b> of <b>5</b></span> <b>Great Product</b> 12/15/2010<br /></span>";
String code2 = "<span id=\"CustomerReviews_customerReviews_ctl12_ratingInfo\"><span class=\"blueText\"><b>2</b> of <b>5</b></span> <b>Bad product</b> 12/03/2010<br /></span>";

Document document = Jsoup.parse(code1 + code2);

Elements elements = document.select("span[id~=CustomerReviews_customerReviews_ctl.*_ratingInfo] ");

for (Element element : elements) {
    System.out.println(element.outerHtml());
        Elements spanBlueText = element.select("span > span > b");
        String note = spanBlueText.get(0).text();
        String max = spanBlueText.get(1).text();
        System.out.println("    - note: " + note + "/" + max);

        String comment = element.select("> b").text();
        System.out.println("    - comment: " + comment);

        String date = element.text();
        date = date.substring(date.length() - 10);
        System.out.println("    - date: " + date);
}

此示例大量使用Jsoup select方法。您可以在Jsoup Cookbook

中找到其参数的正确语法