我正在创建一个类似于抓取工具的抓取工具,可以在网页中找到图像。在这里,生产者生成链接,消费者连接到该链接以查找图像,但由于消费者产生了大量链接,消费者花费了大量时间。所以我将消费者放在执行者服务中,但我找不到消费者所花费的时间。请帮助我。以下是我的代码。
@Service
@Qualifier("crawlerService")
public class CrawlerService {
@Autowired
@Qualifier("loggerService")
LoggerService loggerService;
@Autowired
@Qualifier("imageTypeExtensionCombo")
ImageTypeExtensionCombo imageTypeExtensionCombo;
public List<String> startCrawler(List<String> links, List<String> images, URL url, String protocol, String protocolHost) throws Exception{
LinkQueue queue = new LinkQueue(links);
LinkProducer producer = new LinkProducer(links, url, protocol, protocolHost, queue, loggerService);
LinkConsumer consumer = new LinkConsumer(links, images, url, protocol, protocolHost, loggerService, queue);
ExecutorService executorService = Executors.newFixedThreadPool(4);
executorService.submit(consumer);
producer.start();
//consumer.start();
Thread.currentThread().join();
executorService.shutdown();
return images;
}
}
LinkProducer类
public class LinkProducer extends Thread {
private List<String> anchorList;
private URL url;
private String protocol;
private String protocolHost;
private UrlValidator urlValidator = new UrlValidator();
private LinkQueue queue;
private LoggerService loggerService;
private int MAX_QUEUE_SIZE = 2;
private int counter = 0;
private boolean stopThread = false;
private String HTML_TYPE = "HTML";
private String HTML_CONTENT_TYPE = "text/html";
private String IMAGE_TYPE = "IMAGE";
private String NON_HTML_NON_IMAGE_TYPE = "OTHERS";
public LinkProducer(List<String> anchorList, URL url, String protocol,String protocolHost, LinkQueue queue, LoggerService loggerService) {
super(protocolHost.replace(protocol, "").replaceAll("/", ""));
this.anchorList = anchorList;
this.url = url;
this.protocol = protocol;
this.protocolHost = protocolHost;
this.queue = queue;
this.loggerService = loggerService;
}
public void run() {
int i = 0;
while(true) {
List<String> anchors = null;
loggerService.log("Producer Thread : " + (++i));
try {
anchors = produce();
} catch (Exception ex) {
loggerService.log("Exception occured in producer thread : "+ ex.getMessage());
ex.printStackTrace();
if(stopThread){
break;
}
}
if(stopThread){
break;
}
if(anchors != null && anchors.size() > 0){
Iterator<String> iter = anchors.iterator();
while(iter.hasNext()){
synchronized (queue) {
queue.enQueue(iter.next());
}
}
}
}
}
}
LinkConsumer类
public class LinkConsumer extends Thread {
private List<String> anchorList;
private List<String> imageList;
private URL url;
private String protocol;
private String protocolHost;
private LinkQueue queue;
private LoggerService loggerService;
private UrlValidator urlValidator = new UrlValidator();
private String HTML_TYPE = "HTML";
private String HTML_CONTENT_TYPE = "text/html";
private String IMAGE_TYPE = "IMAGE";
private String NON_HTML_NON_IMAGE_TYPE = "OTHERS";
public LinkConsumer(List<String> anchorList, List<String> imageList, URL url, String protocol,String protocolHost, LoggerService loggerService, LinkQueue queue) {
super(protocolHost.replace(protocol, "").replaceAll("/", ""));
this.anchorList = anchorList;
this.imageList = imageList;
this.url = url;
this.protocol = protocol;
this.protocolHost = protocolHost;
this.queue = queue;
this.loggerService = loggerService;
}
public void run() {
int i = 0;
while (!queue.isEmpty()) {
List<String> images = null;
loggerService.log("Consumer Thread : " + (++i));
try {
images = consume();
} catch (Exception ex) {
loggerService.log("Exception occured in consumer thread : "+ ex.getMessage());
ex.printStackTrace();
}
if (images != null && images.size() > 0) {
Iterator<String> iter = images.iterator();
while (iter.hasNext()) {
imageList.add(iter.next());
}
}
}
}
}
由于
答案 0 :(得分:2)
您只创建并提交一个LinkConsumer
,因此您只有一名工作人员。
要实现真正的并行效果,您需要创建并提交更多LinkConsumer
。
答案 1 :(得分:1)
多线程并没有给你带来很多好处。事实上,当您创建太多线程并且您的硬件不足以处理这些线程时,它会增加复杂性。
多线程只有在您有效使用它时才能获得显着的收益。如果您继续以这种方式创建线程,那么您将无法获得任何性能提升。
您的硬件,尤其是处理器规格以及您写入磁盘的数据量是主要限制因素,这将决定您将获得的性能。
我建议如下。 有多台机器。作为生产者的一台机器将所有URL或图像或您想要的内容写入数据库。客户端系统从DB获取URL并从源获取数据。
从技术上讲,你有多个系统在工作,每台机器一次可以有~10个活动线程。而且您只需要编码一次并在多台计算机上运行相同的代码。您也可以使用与消费者相同的生产者机器。
答案 2 :(得分:1)
您可以尝试这样的方法来创建新线程。但我不确定创建新线程会增加太多时间。您还需要更好的硬件。
public boolean secondThread(){
Thread t = new Thread(){
public void run(){
//do somehting
}
};
t.start();
return true;
}