JSoup-排除链接

时间:2018-06-27 22:06:51

标签: java web-scraping jsoup

我正在尝试排除我不希望爬网的链接列表。

在文档中找不到跳过用户请求的网址的有用信息。

但是,我能够做到这一点:

if(!(link.attr("href").startsWith("https://blog.olark.com") ||
                    link.attr("href").startsWith("http://www.olark.com")||
                    link.attr("href").startsWith("https://www.olark.com")||
                    link.attr("href").startsWith("https://olark.com") ||
                    link.attr("href").startsWith("http://olark.com"))) {
                this.links.add(link.absUrl("href")); //get the absolute url and add it to links list.                       }

当然,这不是正确的方法,所以我将链接包装在列表中并尝试遍历它-但是,它没有排除单个链接(下面的代码):

List<String> exclude = Arrays.asList("https://blog.olark.com", "http://www.olark.com", "https://www.olark.com",  "https://olark.com", "http://olark.com");
            for (String string : exclude) {
                if(!link.attr("href").startsWith(string)) {
                    this.links.add(link.absUrl("href")); //get the absolute url and add it to links list.
                }
            }

所以我的问题是:如何避免使用网址列表?我在想与我编写的第二个代码块类似的东西,但是我愿意提出想法或修正。

1 个答案:

答案 0 :(得分:0)

您可以先选择并删除所有不需要的链接。然后,您无需检查即可处理文档。

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupQuestion51072084 {

    public static void main(final String[] args) throws IOException {
        Document doc = Jsoup.parse("<a href=\"http://www.olark.com\"></a>" +
                "<a href=\"https://www.olark.com\"></a>" +
                "<a href=\"https://google.pl\"></a>" +
                "<a href=\"http://olark.com/qwerty\"></a>" +
                "<a href=\"https://olark.com/asdf\"></a>" +
                "<a href=\"https://stackoverflow.com\"></a>");
        System.out.println("Document before modifications:\n" + doc);

        // select links having "olark.com" in href.
        Elements links = doc.select("a[href*=olark.com]"); 
        System.out.println();
        System.out.println("Links to remove: " + links);
        System.out.println();

        // remove them from the document    
        for (Element link : links) {
            link.remove();
        }
        System.out.println("Document without unwanted links:\n" + doc);
    }
}

,输出为:

Document before modifications:
<html>
 <head></head>
 <body>
  <a href="http://www.olark.com"></a>
  <a href="https://www.olark.com"></a>
  <a href="https://google.pl"></a>
  <a href="http://olark.com/qwerty"></a>
  <a href="https://olark.com/asdf"></a>
  <a href="https://stackoverflow.com"></a>
 </body>
</html>

Links to remove: <a href="http://www.olark.com"></a>
<a href="https://www.olark.com"></a>
<a href="http://olark.com/qwerty"></a>
<a href="https://olark.com/asdf"></a>

Document without unwanted links:
<html>
 <head></head>
 <body>
  <a href="https://google.pl"></a>
  <a href="https://stackoverflow.com"></a>
 </body>
</html>