我正在尝试从多个网站抓取一些信息。
我编写的Java应用程序根据应用程序开头的给定命令行参数从一个网站收集信息。
我同时运行大约5个进程(这意味着我正在从5个网站中进行爬网)。在每个过程中,我正在运行大约5个线程来收集信息。问题在于线程似乎可以通过跳过而不收集某些信息的方式提早完成。
此处是提交固定数量的线程(5)的部分(请阅读注释以解释问题)
String className = collector.getClass().getName();
ExecutorService executor = Executors.newFixedThreadPool(collector.getMaxNoOfThreads());
for (ListPageLink listPageLink : listPageLinks) {
Runnable runnable = () -> {
List<MetaBook> metaBooks;
Collector newCollector = null;
try {
newCollector = ((Collector) Class.forName(className)
.getDeclaredConstructor().newInstance());
metaBooks = newCollector.collectMetaBooksFromSingleLPL
(listPageLink); //Here the collecto returns collected metaBooks successfully and I am logging the collected metabooks with a logger.
for (MetaBook metaBook : metaBooks) { //However here, the application runs for loop for a couple of metaBook and then suddenly thread quits. Because of this, i lose some of the information.
newCollector.collectBook(metaBook);
}
newCollector.saveOrUpdateLPLInDB(listPageLink);
ThreadHelper.sleep(newCollector.getRequestDelayMs());
} catch (Exception e1) {
if (newCollector != null)
newCollector.logError(e1);
}
}; //end of Runnable
executor.execute(runnable);
}
collector.finishExecutorShutDown(executor);
这是Collector.java文件中的相关方法:
public List<MetaBook> collectMetaBooksFromSingleLPL(ListPageLink listPageLink) {
List<MetaBook> metaBooks = new ArrayList<>();
MDC.put("classname", logClassName);
try {
logger.info("ListPageLink: " + listPageLink);
metaBooks = crawler.extractMetaBooks(listPageLink
.getLink());
for (int i = 0; i < metaBooks.size(); i++)
logger.info((i + 1) + ") " + metaBooks.get(i).getLink());
logger.info("MetaBooks Size: " + metaBooks.size());
} catch (Exception e) {
logger.error("Error", e);
}
MDC.remove("classname");
return metaBooks;
}
public Book collectBook(MetaBook metaBook) {
if (sellerId == 0)
sellerId = saveSellerInDbIfNotExists(seller);
Book book = null;
MDC.put("classname", logClassName);
try {
logger.info("MetaBook: " + metaBook);
Book foundBook = hibernateBookDao.find(metaBook.getLink());
if (foundBook != null) {
book = new Book(metaBook);
hibernateBookDao.updatePrice(book);
foundBook = hibernateBookDao.find(metaBook.getLink());
book = foundBook;
} else {
book = bookDataExtractor.extractBookData(metaBook);
if (book.isAlive()) {
book.setSellerId(sellerId);
hibernateBookDao.saveOrUpdate(book);
logger.info("Book Name: " + book.getName());
}
}
} catch (Exception e) {
logger.error("Error", e);
}
MDC.remove("classname");
return book;
}
public void finishExecutorShutDown(ExecutorService executor) {
executor.shutdown();
try {
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.SECONDS);
} catch (InterruptedException e) {
logger.error("Error", e);
}
}
这是一个示例日志文件,您可以在其中看到已收集metaBook,但未将它们馈送到“ collectBook”方法中:
10:39:30:421 | INFO | pool-2-thread-9 | | ListPageLink: https://www..com/index.php?p=Products&ctg_id=2021&sort_type=rel-desc&page=2164
10:39:32:885 | INFO | pool-2-thread-12 | | 1) https://www..com/m-alice-legrows/bizenghast-kayip-ruhlar-kasabasi.htm
10:39:32:885 | INFO | pool-2-thread-12 | | 2) https://www..com/amy-reeder-hadley/aptallarin-altini.htm
10:39:32:885 | INFO | pool-2-thread-12 | | 3) https://www..com/david-hine/zehirli-seker.htm
10:39:32:885 | INFO | pool-2-thread-12 | | 4) https://www..com/victor-hugo/sefiller-31.htm
10:39:32:885 | INFO | pool-2-thread-12 | | 5) https://www..com/johann-wolfgang-von-goethe/genc-wertherin-acilari-2.htm
10:39:32:885 | INFO | pool-2-thread-12 | | 6) https://www..com/charles-dickens/iki-sehrin-hikayesi-14.htm
10:39:32:885 | INFO | pool-2-thread-12 | | 7) https://www..com/recaizade-mahmut-ekrem/araba-sevdasi-16.htm
10:39:32:885 | INFO | pool-2-thread-12 | | 8) https://www..com/samipasazade-sezai/serguzest-15.htm
10:39:32:885 | INFO | pool-2-thread-12 | | 9) https://www..com/namik-kemal/intibah-17.htm
10:39:32:885 | INFO | pool-2-thread-12 | | 10) https://www..com/mehmet-rauf/eylul-20.htm
10:39:32:886 | INFO | pool-2-thread-12 | | MetaBooks Size: 10
10:39:32:887 | INFO | pool-2-thread-12 | | MetaBook: https://www..com/m-alice-legrows/bizenghast-kayip-ruhlar-kasabasi.htm
10:39:32:891 | INFO | pool-2-thread-12 | | MetaBook: https://www..com/amy-reeder-hadley/aptallarin-altini.htm
10:39:32:895 | INFO | pool-2-thread-12 | | MetaBook: https://www..com/david-hine/zehirli-seker.htm
正如您在日志中看到的那样,十分之三的metabook是由collectBook方法处理的。
当我单独运行应用程序时(没有其他4个进程),这不会发生。一切正常。我不知道这是由于OS或JVM限制还是我的代码有问题。
如果您能帮忙解决此问题,我将很高兴。
谢谢
编辑:
针对数据库锁定修改了collectBook方法:
public Book collectBook(MetaBook metaBook) {
if (sellerId == 0)
sellerId = saveSellerInDbIfNotExists(seller);
Book book = null;
MDC.put("classname", logClassName);
try {
logger.info("MetaBook: " + metaBook);
Book foundBook;
synchronized (CollectorsUtil.getCollectors()) {
foundBook = hibernateBookDao.find(metaBook.getLink());
}
if (foundBook != null) {
book = new Book(metaBook);
synchronized (CollectorsUtil.getCollectors()) {
hibernateBookDao.updatePrice(book);
foundBook = hibernateBookDao.find(metaBook.getLink());
}
book = foundBook;
} else {
book = bookDataExtractor.extractBookData(metaBook);
if (book.isAlive()) {
synchronized (CollectorsUtil.getCollectors()) {
book.setSellerId(sellerId);
hibernateBookDao.saveOrUpdate(book);
}
logger.info("Book Name: " + book.getName());
}
}
} catch (Exception e) {
logger.error("Error", e);
}
MDC.remove("classname");
return book;
}
private long saveSellerInDbIfNotExists(Seller seller) {
Seller foundSeller;
synchronized (CollectorsUtil.getCollectors()) {
foundSeller = hibernateSellerDao.find(seller.getLink());
if (foundSeller == null || foundSeller.equals(new Seller())) {
hibernateSellerDao.save(seller);
}
foundSeller = hibernateSellerDao.find(seller.getLink());
}
this.seller = foundSeller;
return foundSeller.getId();
}