Question

我正在使用jsoup在Java中创建一个Web爬虫（使用此tutorial）。

我面临的问题是爬虫会在每个链接中进入Element循环，其中一些是电子邮件地址。因此，当我尝试在电子邮件地址上使用Jsoup.connect(URL)时，我收到错误，告诉我只能使用http或https请求。

如何在获取电子邮件地址链接时停止我的程序进行递归？

这是主要代码：

public class Main {

public static DB db = new DB();

public static void main(String[] args) throws SQLException, IOException{
    db.runSql2("TRUNCATE Record;");
    processPage("http://www.mit.edu");
}

public static void processPage(String URL) throws SQLException,IOException{
    String sql = "select * from Record where URL = '" +URL+"'";
    ResultSet rs = db.runSql(sql);
    if(rs.next()){

    } else {
        sql = "INSERT INTO  `Crawler`.`Record` " + "(`URL`) VALUES " + "(?);";
        PreparedStatement stmt = db.conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
        stmt.setString(1,URL);
        stmt.execute();

        Document doc = Jsoup.connect(URL).get();

        if(doc.text().contains("research")){
            System.out.println(URL);
        }

        Elements questions = doc.select("a[href]");
        for(Element link:questions){
            if(link.attr("href").contains("mit.edu")){
                System.out.println(link.attr("abs:href"));
                processPage(link.attr("abs:href"));
            }
        }

    }
}

}

Answer 1

您可以通过查看链接是否以http开头来检查链接是否为网址。因为您拥有绝对网址（使用abs:href），并且以http开头，所以它只能是http或https网址（而不是指向电子邮件地址或FTP网站的链接，其他一些你不想要的垃圾。）

例如，将for循环更新为：

for (Element link : questions) {
    String href = link.attr("abs:href");
    if (href.contains("mit.edu") && href.startsWith("http")) {
        System.out.println(href);
        processPage(href);
    }
}

另外，我倾向于在每个processPage调用周围放置一个try / catch，这样如果你在获取页面时遇到一个错误（比如网络超时或其他），你的整个应用程序都不会崩溃。

Answer 2

您需要测试!link.attr("abs:href").startsWith("mailto:")。

Answer 3

你已经相当接近，这感觉就像是一项任务，所以我只是给你一个轻推，而不是完整的答案。

您正在检查它是否是mit.edu页面：

if(link.attr("href").contains("mit.edu")){
                System.out.println(link.attr("abs:href"));
                processPage(link.attr("abs:href"));
            }

现在，您需要一个额外的条件来查找仅以http或https开头的内容。 Check out String.startsWith()方法，并在您致电processPage之前使用该方法检查超链接的值。

网络抓取工具在电子邮件链接上被阻止

3 个答案: