爬行Web和存储链接

时间:2014-10-20 07:14:12

标签: java web web-crawler

我想创建一个线程,以便抓取网站的所有链接并将其存储在LinkedHashSet中,但是当我打印此LinkedHashSet的大小时,它不会打印任何内容。我开始学习爬行了!我引用了Java的艺术。这是我的代码:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.LinkedHashSet;
import java.util.logging.Level;
import java.util.logging.Logger;

public class TestThread {

    public void crawl(URL url) {
        try {

            BufferedReader reader = new BufferedReader(
                    new InputStreamReader(url.openConnection().getInputStream()));
            String line = reader.readLine();
            LinkedHashSet toCrawlList = new LinkedHashSet();

            while (line != null) {
                toCrawlList.add(line);
                System.out.println(toCrawlList.size());
            }
        } catch (IOException ex) {
            Logger.getLogger(TestThread.class.getName()).log(Level.SEVERE, null, ex);
        }

    }

    public static void main(String[] args) {
        final TestThread test1 = new TestThread();
        Thread thread = new Thread(new Runnable() {
           public void run(){
               try {
                   test1.crawl(new URL("http://stackoverflow.com/"));
               } catch (MalformedURLException ex) {
                   Logger.getLogger(TestThread.class.getName()).log(Level.SEVERE, null, ex);
               }
           } 
        });
    }
}

1 个答案:

答案 0 :(得分:0)

你应该像这样填写你的名单:

while ((line = reader.readLine()) != null) {
   toCrawlList.add(line);
}
System.out.println(toCrawlList.size());

如果这不起作用,请尝试在代码中设置断点,并查明读者是否包含任何内容。