使用jsoup从特定标记之间的网页中抓取数据

时间:2017-08-18 14:11:02

标签: java html css css-selectors jsoup

目前我正在开发一个程序,允许我收集添加到我的Ao3(Archive of Our Own)粉丝中的最新5个小说故事。然后将这些故事添加到我已设置的ArrayList中,该ListList将保留过去一周的fanfiction提交。在每周结束时,我计划拥有 将ArrayList的内容转储到一个文本文件中,这样我就可以将它粘贴到我的subreddit的Reddit帖子中。现在,为了防止重复,我想将新解析的故事与当前在ArrayList中保存的故事进行比较。

(附加信息:机器人将每隔30分钟检查一次网页)

我遇到的部分是实际解析网页并从HTML标记之间获取内容。

我查了一下CSS Selectors,但我仍然完全糊涂了,因为几乎所有的例子都来自于一个简单的网站,例如IMBD。

从基础研究来看,它看起来像在我看的主体内,故事都在有序列表标签内。

<o1 class="work index group">
    <li class="work blurb group" id="work_10504812" role="article>...</li>
    <li class="work blurb group" id="work_9656693" role="article>...</li>
    <li class="work blurb group" id="work_11814486" role="article>...</li>
    //Goes on for ~20 more stories
    <li class="work blurb group" id="work_11687247" role="article>...</li>
</ol>

因此,为了清楚起见,每个列表类型都是位于有序列表中的单个故事。一个列表标签内的任何内容都如下。 (为上下文添加了有序列表标记)

<ol class="work index group">
    <li class="work blurb group" id="work_10504812" role="article">
  <!--title, author, fandom-->
  <div class="header module">
    <h4 class="heading">
      <a href="/works/10504812">Pocket Healer</a>
      by

      <!-- do not cache -->
      <a rel="author" href="/users/OverNoot/pseuds/OverNoot">OverNoot</a> 
    </h4>
    <h5 class="fandoms heading">
      <span class="landmark">Fandoms:</span>
      <a class="tag" href="/tags/Overwatch%20(Video%20Game)/works">Overwatch (Video Game)</a>
      &nbsp;
    </h5>
    <!--required tags-->
    <ul class="required-tags">
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="rating-general-audience rating" title="General Audiences"><span class="text">General Audiences</span></span></a></li>
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="warning-no warnings" title="No Archive Warnings Apply"><span class="text">No Archive Warnings Apply</span></span></a></li>
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="category-femslash category" title="F/F"><span class="text">F/F</span></span></a></li>
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="complete-no iswip" title="Work in Progress"><span class="text">Work in Progress</span></span></a></li>
</ul>
    <p class="datetime">17 Aug 2017</p>
  </div>
  <!--warnings again, cast, freeform tags-->
  <h6 class="landmark heading">Tags</h6>
  <ul class="tags commas">
    <li class="warnings"><strong><a class="tag" href="/tags/No%20Archive%20Warnings%20Apply/works">No Archive Warnings Apply</a></strong></li><li class="relationships"><a class="tag" href="/tags/Fareeha%20%22Pharah%22%20Amari*s*Angela%20%22Mercy%22%20Ziegler/works">Fareeha "Pharah" Amari/Angela "Mercy" Ziegler</a></li><li class="characters"><a class="tag" href="/tags/Fareeha%20%22Pharah%22%20Amari/works">Fareeha "Pharah" Amari</a></li> <li class="characters"><a class="tag" href="/tags/Angela%20%22Mercy%22%20Ziegler/works">Angela "Mercy" Ziegler</a></li> <li class="characters"><a class="tag" href="/tags/Winston%20(Overwatch)/works">Winston (Overwatch)</a></li> <li class="characters"><a class="tag" href="/tags/Lena%20%22Tracer%22%20Oxton/works">Lena "Tracer" Oxton</a></li><li class="freeforms"><a class="tag" href="/tags/Tiny%20Pharah%20and%20Tiny%20Mercy/works">Tiny Pharah and Tiny Mercy</a></li> <li class="freeforms"><a class="tag" href="/tags/Fluff/works">Fluff</a></li> <li class="freeforms last"><a class="tag" href="/tags/Cute/works">Cute</a></li>
  </ul>
  <!--summary-->
    <h6 class="landmark heading">Summary</h6>
    <blockquote class="userstuff summary">
      <p>Angela and Fareeha wake up to find tiny alternate versions of themselves have appeared and are now imprinted on them. How will these tiny Pharahs and Mercies impact their work at Overwatch and more importantly how will it impact the feelings they have for each other.</p>
    </blockquote>
  <!--stats-->

  <dl class="stats">
      <dt class="language">Language:</dt>
      <dd class="language">English</dd>
    <dt class="words">Words:</dt>
    <dd class="words">35,143</dd>
    <dt class="chapters">Chapters:</dt>
    <dd class="chapters">10/11</dd>
    <dt class="comments">Comments:</dt>
    <dd class="comments"><a href="/works/10504812?show_comments=true&amp;view_full_work=true#comments">168</a></dd>
    <dt class="kudos">Kudos:</dt>
    <dd class="kudos"><a href="/works/10504812?view_full_work=true#comments">438</a></dd>
    <dt class="bookmarks">Bookmarks:</dt>
    <dd class="bookmarks"><a href="/works/10504812/bookmarks">35</a></dd>
    <dt class="hits">Hits:</dt>
    <dd class="hits">5890</dd>
  </dl>
</li>

基本上我想提取标题,作者,网址,摘要和评级。

到目前为止,我已经收集了我想要提取的项目的位置,但我不知道该怎么做。

标题:

<a href="/works/10504812">Pocket Healer</a>

作者:

<a rel="author" href="/users/OverNoot/pseuds/OverNoot">OverNoot</a>

网址:

<li class="work blurb group" id="work_10504812" role="article">
<!--(http://archiveofourown.com/works/<the number after 'work_'>)-->

要点:

<blockquote class="userstuff summary">
    <p> (SUMMARY GOES HERE) </p>
</blockquote>

评分:

<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="rating-general-audience rating" title="General Audiences"><span class="text">General Audiences</span></span></a></li>

附加问题:是否有可能以类似forloop的方式迭代有序列表的内容?

我为打开网页而设置的当前代码如下。

    while (true) {
        try {

            String url = "http://archiveofourown.org/tags/Fareeha%20%22Pharah%22%20Amari*s*Angela%20%22Mercy%22%20Ziegler/works";
            Document doc = Jsoup.connect(url).get();

            //Returns element of webpage
            doc.select("<Narrow down to ordered list>");

            //Run for loop to run through first 5 items of 
            Thread.sleep(THIRTY_MINUTES);

        }
        catch (Exception ex) {
            ex.printStackTrace();
        }

    }

1 个答案:

答案 0 :(得分:0)

您可以使用返回Document.select(String cssSelector)Elements方法进行迭代。例如,li会将作为第一级子元素的所有ol.work元素返回到此Elements ol = doc.select("ol.work > li"); for (Element li : ol) { String title = li.select("h4.heading a").first().text(); String author = li.select("h4.heading a[rel=author]").text(); String id = li.attr("id").replaceAll("work_",""); String url = "http://archiveofourown.com/works/" + id; String summary = li.select("blockquote.summary").text(); String rating = li.select("span.rating").text(); System.out.println("Title: " + title); System.out.println("Author: " + author); System.out.println("ID: " + id); System.out.println("URL: " + url); System.out.println("Summary: " + summary); System.out.println("Rating: " + rating); } 元素。您可以使用它来迭代所有故事。

考虑以下部分代码:

li

在此示例中,我们获取for循环中的所有select元素并提取预期内容。如您所见,我们对每个限制为当前li元素的数据提取使用Title: Pocket Healer Author: OverNoot ID: 10504812 URL: http://archiveofourown.com/works/10504812 Summary: Angela and Fareeha wake up to find tiny alternate versions of themselves have appeared and are now imprinted on them. How will these tiny Pharahs and Mercies impact their work at Overwatch and more importantly how will it impact the feelings they have for each other. Rating: General Audiences 方法。 Element.text()方法将元素的主体作为纯文本返回,删除所有标记(如果它们存在)。

使用您在问题中输入的HTML运行以下代码会产生以下输出:

change

我希望它有所帮助。