拆分jSoup抓取结果

时间:2016-07-28 09:58:50

标签: java web-scraping jsoup

我使用Java上的jSoup库从this link抓取。我的来源工作得很好,我想问一下如何分割我得到的每一个元素?

这是我的来源

package javaapplication1;

import java.io.IOException;
import java.sql.SQLException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class coba {

    public static void main(String[] args) throws SQLException  {
    MasukDB db=new MasukDB();        
        try {
            Document doc = null;
            for (int page = 1; page < 2; page++) {
                doc = Jsoup.connect("http://hackaday.com/page/" + page).get();
                System.out.println("title : " + doc.select(".entry-title>a").text() + "\n");
                System.out.println("link : " + doc.select(".entry-title>a").attr("href") + "\n");
                System.out.println("body : " + String.join("", doc.select(".entry-content p").text()) + "\n");
                System.out.println("date : " + doc.select(".entry-date>a").text() + "\n");
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

在结果中,网站的每一页都变成一行,如何将它们分开?以及如何获得每篇文章的链接,我认为链接端的CSS选择器仍然是错误的 谢谢伙伴

1 个答案:

答案 0 :(得分:0)

 doc.select(".entry-title>a").text()

这将搜索整个文档并返回一个链接列表,您正在从中抓取其文本节点。但是,您可能想要抓取每篇文章,然后从每个文章中获取相关数据。

    Document doc;
    for (int page = 1; page < 2; page++) {

        doc = Jsoup.connect("http://hackaday.com/page/" + page).get();

        // get a list of articles on page
        Elements articles = doc.select("main#main article");

        // iterate article list
        for (Element article : articles) {

            // find the article header, which includes title and date
            Element header = article.select("header.entry-header").first();

            // find and scrape title/link from header
            Element headerTitle = header.select("h1.entry-title > a").first();
            String title = headerTitle.text();
            String link = headerTitle.attr("href");

            // find and scrape date from header
            String date = header.select("div.entry-meta > span.entry-date > a").text();

            // find and scrape every paragraph in the article content
            // you probably will want to further refine the logic here
            // there may be paragraphs you don't want to include
            String body = article.select("div.entry-content p").text();

            // view results
            System.out.println(
                    MessageFormat.format(
                            "title={0} link={1} date={2} body={3}", 
                            title, link, date, body));
        }
    }

有关如何抓取此类数据的更多示例,请参阅CSS Selectors