我使用Java上的jSoup库从this link抓取。我的来源工作得很好,我想问一下如何分割我得到的每一个元素?
这是我的来源
package javaapplication1;
import java.io.IOException;
import java.sql.SQLException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class coba {
public static void main(String[] args) throws SQLException {
MasukDB db=new MasukDB();
try {
Document doc = null;
for (int page = 1; page < 2; page++) {
doc = Jsoup.connect("http://hackaday.com/page/" + page).get();
System.out.println("title : " + doc.select(".entry-title>a").text() + "\n");
System.out.println("link : " + doc.select(".entry-title>a").attr("href") + "\n");
System.out.println("body : " + String.join("", doc.select(".entry-content p").text()) + "\n");
System.out.println("date : " + doc.select(".entry-date>a").text() + "\n");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
在结果中,网站的每一页都变成一行,如何将它们分开?以及如何获得每篇文章的链接,我认为链接端的CSS选择器仍然是错误的 谢谢伙伴
答案 0 :(得分:0)
doc.select(".entry-title>a").text()
这将搜索整个文档并返回一个链接列表,您正在从中抓取其文本节点。但是,您可能想要抓取每篇文章,然后从每个文章中获取相关数据。
Document doc;
for (int page = 1; page < 2; page++) {
doc = Jsoup.connect("http://hackaday.com/page/" + page).get();
// get a list of articles on page
Elements articles = doc.select("main#main article");
// iterate article list
for (Element article : articles) {
// find the article header, which includes title and date
Element header = article.select("header.entry-header").first();
// find and scrape title/link from header
Element headerTitle = header.select("h1.entry-title > a").first();
String title = headerTitle.text();
String link = headerTitle.attr("href");
// find and scrape date from header
String date = header.select("div.entry-meta > span.entry-date > a").text();
// find and scrape every paragraph in the article content
// you probably will want to further refine the logic here
// there may be paragraphs you don't want to include
String body = article.select("div.entry-content p").text();
// view results
System.out.println(
MessageFormat.format(
"title={0} link={1} date={2} body={3}",
title, link, date, body));
}
}
有关如何抓取此类数据的更多示例,请参阅CSS Selectors。