JSoup按属性值抓取HTML文档

时间:2014-10-22 00:17:48

标签: java html jsoup

我想创建一个动态网站,并需要一些关于互联网的照片。我决定把它们从flickr中删除,并将所有者包括在我的网站上,但我遇到了问题。我将在下面发布部分HTML,但如果你想自己查看源代码,这里是网站。 https://www.flickr.com/explore

HTML:

<div class="thumb ">            

    <span class="photo_container pc_ju">
        <a data-track="photo-click"  href="/photos/sheilarogers13/15586482942/in/explore-2014-10-20" title="Lake District" class="rapidnofollow photo-click"><img id="photo_img_15586482942" src="https://c2.staticflickr.com/4/3945/15586482942_6a7154363f_z.jpg"width="508" height="339" alt="Lake District" class="pc_img " border="0"><div class="play"></div></a>
    </span>
    <div class="meta">
        <div class="title"><a data-track="photo-click" href="/photos/sheilarogers13/15586482942/in/explore-2014-10-20" title="Lake District" class="title">Lake District</a></div>

        <div class="attribution-block">
            <span class="attribution">
                <span>by </span>
              ******<a data-track="owner" href="/photos/sheilarogers13" title="sheilarogers22" class="owner">sheilarogers22</a>******
            </span>
        </div>

        <span class="inline-icons">

                <a data-track="favorite" href="#" class="rapidnofollow fave-star-inline canfave" title="Add this photo to your favorites?"><img width="12" height="12" alt="[★]" src="https://s.yimg.com/pw/images/spaceball.gif" class="img"><span class="fave-count count">99+</span></a>
            <a title="Comments" href="#" class="rapidnofollow comments-icon comments-inline-btn">
                <img width="12" height="12" alt="Comments" src="https://s.yimg.com/pw/images/spaceball.gif">
                <span class="comment-count count">57</span>
            </a>
            <a href="#" data-track="lightbox" class="rapidnofollow lightbox-inline" title="View in light box"><img width="12" height="12" alt="" src="https://s.yimg.com/pw/images/spaceball.gif"></a>
        </span>
    </div>      
</div>

我想要用星号放置的行,以便能够将图片的作者称赞。

我的代码:

Elements pgElem = doc.select("div.thumb").select("div.meta").select("[data-track]");

上面的代码却给了我div.meta中的所有4个数据轨道,但我只想要那个=所有者的那个。

我检查了JSoup文档,它说使用[attr=value]找到了带有值的属性,但我似乎无法让它工作。我试过了:

.select("[data-track=owner]")

.select("[data-track='owner']")

但是没有工作。想法?

1 个答案:

答案 0 :(得分:4)

        Elements pgElem = doc.select("div.thumb").select("div.meta").select("[data-track]");
        Elements ownerElements = new Elements();
        for(Element element:pgElem){
            if(!element.getElementsByAttributeValueContaining("data-track","owner").isEmpty()){
                ownerElements.add(element);
            }
        }
实际上,我只是给了它另一个旋转,这对我有用:

doc.select("div.thumb").select("div.meta").select("[data-track=owner]")