如何检查ConcurrentLinkedQueue的size()或isEmpty()

时间:2016-05-28 20:34:43

标签: java multithreading concurrency web-crawler executorservice

我正在尝试使用Java为Web爬虫构建一个简单的结构。到目前为止,原型只是试图执行以下操作:

  • 使用起始网址列表初始化队列
  • 从队列中取出网址并提交给新的主题
  • 执行一些工作,然后将该网址添加到一组已访问过的网址

对于起始网址的队列,我使用ConcurrentLinkedQueue进行同步。 为了产生新的线程我正在使用ExecutorService

但是在创建新线程时,应用程序需要检查ConcurrentLinkedQueue是否为空。我尝试使用:

  • .size()
  • .isEmpty()

但两者似乎都没有返回ConcurrentLinkedQueue的真实状态。

问题出在下面的块中:

while (!crawler.getUrl_horizon().isEmpty()) {
                workers.submitNewWorkerThread(crawler);
            }

因此,即使输入只有2个URL,ExecutorService也会在其限制内创建所有线程。

这里实现多线程的方式有问题吗?如果没有,检查ConcurrentLinkedQueue状态的更好方法是什么?

启动应用程序的类:

public class CrawlerApp {

    private static Crawler crawler;

    public static void main(String[] args) {
        crawler = = new Crawler();
        initializeApp();
        startCrawling();

    }

    private static void startCrawling() {
        crawler.setUrl_visited(new HashSet<URL>());
        WorkerManager workers = WorkerManager.getInstance();
        while (!crawler.getUrl_horizon().isEmpty()) {
            workers.submitNewWorkerThread(crawler);
        }
        try {
            workers.getExecutor().shutdown();
            workers.getExecutor().awaitTermination(10, TimeUnit.MINUTES);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

    private static void initializeApp() {

        Properties config = new Properties();
        try {
            config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
            String[] horizon = config.getProperty("urls").split(",");
            ConcurrentLinkedQueue<URL> url_horizon = new ConcurrentLinkedQueue<>();
            for (String link : horizon) {
                URL url = new URL();
                url.setURL(link);
                url_horizon.add(url);
            }
            crawler.setUrl_horizon(url_horizon);
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

}

Crawler.java维护网址和已访问过的网址集。

public class Crawler implements Runnable {
    private ConcurrentLinkedQueue<URL> url_horizon;

    public void setUrl_horizon(ConcurrentLinkedQueue<URL> url_horizon) {
        this.url_horizon = url_horizon;
    }

    public ConcurrentLinkedQueue<URL> getUrl_horizon() {
        return url_horizon;
    }

    private Set<URL> url_visited;

    public void setUrl_visited(Set<URL> url_visited) {
        this.url_visited = url_visited;
    }

    public Set<URL> getUrl_visited() {
        return Collections.synchronizedSet(url_visited);
    }

    @Override
    public void run() {
        URL url = nextURLFromHorizon();
        scrap(url);
        addURLToVisited(url);

    }

    private URL nextURLFromHorizon() {
        if (!getUrl_horizon().isEmpty()) {
            URL url = url_horizon.poll();
            if (getUrl_visited().contains(url)) {
                return nextURLFromHorizon();
            }
            System.out.println("Horizon URL:" + url.getURL());
            return url;

        }
        return null;

    }

    private void scrap(URL url) {
        new Scrapper().scrap(url);
    }

    private void addURLToVisited(URL url) {
        System.out.println("Adding to visited set:" + url.getURL());
        getUrl_visited().add(url);
    }

}

URL.java只是一个private String url的类,并覆盖hashCode()equals()

此外,Scrapper.scrap()到目前为止只有虚拟实现:

public void scrap(URL url){
        System.out.println("Done scrapping:"+url.getURL());
    }

WorkerManager创建主题:

public class WorkerManager {
    private static final Integer WORKER_LIMIT = 10;
    private final ExecutorService executor = Executors.newFixedThreadPool(WORKER_LIMIT);

    public ExecutorService getExecutor() {
        return executor;
    }

    private static volatile WorkerManager instance = null;

    private WorkerManager() {
    }

    public static WorkerManager getInstance() {
        if (instance == null) {
            synchronized (WorkerManager.class) {
                if (instance == null) {
                    instance = new WorkerManager();
                }
            }
        }

        return instance;
    }

    public Future submitNewWorkerThread(Runnable run) {
        return executor.submit(run);
    }

}

1 个答案:

答案 0 :(得分:2)

问题

你最终创建了比队列中的URL更多的线程的原因是因为在你通过while循环之前,可能(实际上很可能)Executor的所有线程都没有启动很多次。

每当使用线程时,您应始终牢记线程是独立调度的,并以自己的速度运行,除非您明确同步它们。在这种情况下,线程可以在submit()调用后的任何时间开始,即使您似乎希望每个线程在nextURLFromHorizon的下一次迭代之前开始并超过while Runnable 1}}循环。

解决方案

在将CrawlerTask提交给执行者之前,请考虑将队列中的URL出列。我还建议定义一次提交给Executor的Crawler,而不是重复提交的class CrawlerTask extends Runnable { URL url; CrawlerTask(URL url) { this.url = url; } @Override public void run() { scrape(url); // add url to visited? } } class Crawler { ExecutorService executor; Queue urlHorizon; //... private static void startCrawling() { while (!urlHorizon.isEmpty()) { executor.submit(new CrawlerTask(urlHorizon.poll()); } // ... } } 。在这样的设计中,你甚至不需要一个线程安全的容器来处理要被删除的URL。

A = [1  2  3  4;
     5  6  7  8;
     9 10 11 12];

x = [1  1  3  2];
y = [2  4  3  1];

%Approach-1 (Yours approach)
diagonal = diag(A(x,y))

%Approach-2
A1=A(x,y);   LowUp=A1(tril(triu(A1))~=0)

%Approach-3
EYE= A1((eye(4,4).*A1)~=0)

%Approach-4
findeye=A1(find(eye(size(A1))))

%Approach-5
subind=A(sub2ind(size(A),x,y)).'

%Approach-6
for i=1:4
    loop(i)=A(x(i),y(i));
end
loop=loop.'