用汤提取数据

时间:2014-12-24 15:20:05

标签: java jsoup

我正在使用jsoup从网络中提取信息,我的代码是这样的:

doc = Jsoup.connect(myurl).get();
            Elements newsHeadlines = doc.select(".myclass");

如果我做了newsHeadlines的System.out.println,我得到了这个:

<span class="cmtComentario">
<span class="blaicon"></span>
<span class="blacoment"><span class="cmtHora" data-hora=""></span>
<span class="blathing" data-minutoPartido="93'"></span>
<span class="blado"></span>
<span class="blahave">
Oh yeah!<br/></span>
</span>
</span>
<span class="cmtComentario">
<span class="blaicon"></span>
<span class="blacoment"><span class="cmtHora" data-hora=""></span>
<span class="blathing" data-health="97'"></span>
<span class="blado"></span>
<span class="blahave">
This is my world</span>
</span>
</span>

如何在每个块上保存数组:

<span class="cmtComentario">
    <span class="blaicon"></span>
    <span class="blacoment"><span class="cmtHora" data-hora=""></span>
    <span class="blathing" data-health="92'"></span>
    <span class="blado"></span>
    <span class="blahave">
    This is my world</span>
    </span>
    </span>

非常感谢

2 个答案:

答案 0 :(得分:1)

newsHeadlines只是Element列表Elements实现列表。

因此,您可以以迭代列表的方式迭代newsHeadlines

for(Element element : newsHeadlines) {
    System.out.println(element.toString());
}

如果这不是您需要的(我没有测试代码),您可以尝试Element.children。 这再次为您提供了可以迭代的元素。

答案 1 :(得分:0)

您还可以为每个评论添加div标记,并使用一些Java 8语法糖来收集Element中的List个实例

    Elements elements = Jsoup.parse(markup).getAllElements().select(".myclass");
    List<Element> comments = elements.stream().collect(Collectors.<Element>toList());
    for(Element comment : comments)  {
        System.out.println(comment.html());
    }

为了测试我使用了parse,而不是connect-method。

打印:

<span class="cmtComentario"> <span class="blaicon">1</span>.......
<span class="cmtComentario"> <span class="blaicon">2</span>........

测试标记:

String markup = "" +
        "<div class=\"myclass\">\n" +
            "<span class=\"cmtComentario\">\n" +
            "<span class=\"blaicon\">1</span>\n" +
            "<span class=\"blacoment\"><span class=\"cmtHora\" data-hora=\"\"></span>\n" +
            "<span class=\"blathing\" data-minutoPartido=\"93'\"></span>\n" +
            "<span class=\"blado\"></span>\n" +
            "<span class=\"blahave\">\n" +
            "Oh yeah!<br/></span>\n" +
            "</span>\n" +
            "</span>\n" +
        "</div>" +
        "<div class=\"myclass\">\n" +
            "<span class=\"cmtComentario\">\n" +
            "<span class=\"blaicon\">2</span>\n" +
            "<span class=\"blacoment\"><span class=\"cmtHora\" data-hora=\"\"></span>\n" +
            "<span class=\"blathing\" data-health=\"97'\"></span>\n" +
            "<span class=\"blado\"></span>\n" +
            "<span class=\"blahave\">\n" +
            "This is my world</span>\n" +
            "</span>\n" +
            "</span>" +
        "</div>";

希望它有所帮助!