应用错误收集

仅从网址收集相关链接

时间：2014-03-17 10:13:13

标签： java solr web-crawler crawler4j

我需要从url收集相关链接。例如，从http://beechplane.wordpress.com/这样的链接，我需要收集包含实际文章的链接。即，http://beechplane.wordpress.com/2012/11/07/the-95-confidence-of-nate-silver/，http://beechplane.wordpress.com/2012/03/06/visualizing-probability-roulette/等链接。

如何在Java中获取这些链接？是否可以使用网络爬虫？

1 个答案:

答案 0 :(得分：0)

我使用jsoup库。

如何从文档中获取所有<a>标记：

Elements a = doc.select("a");
for (Element el : a) {
    //process element
    String href = el.attr("href");
}