我正在尝试排除我不希望爬网的链接列表。
在文档中找不到跳过用户请求的网址的有用信息。
但是,我能够做到这一点:
if(!(link.attr("href").startsWith("https://blog.olark.com") ||
link.attr("href").startsWith("http://www.olark.com")||
link.attr("href").startsWith("https://www.olark.com")||
link.attr("href").startsWith("https://olark.com") ||
link.attr("href").startsWith("http://olark.com"))) {
this.links.add(link.absUrl("href")); //get the absolute url and add it to links list. }
当然,这不是正确的方法,所以我将链接包装在列表中并尝试遍历它-但是,它没有排除单个链接(下面的代码):
List<String> exclude = Arrays.asList("https://blog.olark.com", "http://www.olark.com", "https://www.olark.com", "https://olark.com", "http://olark.com");
for (String string : exclude) {
if(!link.attr("href").startsWith(string)) {
this.links.add(link.absUrl("href")); //get the absolute url and add it to links list.
}
}
所以我的问题是:如何避免使用网址列表?我在想与我编写的第二个代码块类似的东西,但是我愿意提出想法或修正。
答案 0 :(得分:0)
您可以先选择并删除所有不需要的链接。然后,您无需检查即可处理文档。
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupQuestion51072084 {
public static void main(final String[] args) throws IOException {
Document doc = Jsoup.parse("<a href=\"http://www.olark.com\"></a>" +
"<a href=\"https://www.olark.com\"></a>" +
"<a href=\"https://google.pl\"></a>" +
"<a href=\"http://olark.com/qwerty\"></a>" +
"<a href=\"https://olark.com/asdf\"></a>" +
"<a href=\"https://stackoverflow.com\"></a>");
System.out.println("Document before modifications:\n" + doc);
// select links having "olark.com" in href.
Elements links = doc.select("a[href*=olark.com]");
System.out.println();
System.out.println("Links to remove: " + links);
System.out.println();
// remove them from the document
for (Element link : links) {
link.remove();
}
System.out.println("Document without unwanted links:\n" + doc);
}
}
,输出为:
Document before modifications:
<html>
<head></head>
<body>
<a href="http://www.olark.com"></a>
<a href="https://www.olark.com"></a>
<a href="https://google.pl"></a>
<a href="http://olark.com/qwerty"></a>
<a href="https://olark.com/asdf"></a>
<a href="https://stackoverflow.com"></a>
</body>
</html>
Links to remove: <a href="http://www.olark.com"></a>
<a href="https://www.olark.com"></a>
<a href="http://olark.com/qwerty"></a>
<a href="https://olark.com/asdf"></a>
Document without unwanted links:
<html>
<head></head>
<body>
<a href="https://google.pl"></a>
<a href="https://stackoverflow.com"></a>
</body>
</html>