如何使用Headless Chrome Crawler搜寻整个网站?

时间:2018-10-18 20:39:00

标签: node.js async-await web-crawler puppeteer google-chrome-headless

我一直在研究chrome puppeteer,以开发用于学习目的的履带。所以我发现了HeadLess Chrome Crawler,这是一个很好的节点程序包。但是,我发现使用此工具包尝试爬网整个网站有些麻烦。我在文档中找不到可以执行此操作的文档。我想从页面获取所有链接,并将它们传递到数组列表中以对其进行爬网。现在这是我的代码:

const HCCrawler = require('headless-chrome-crawler');

(async() => {
  var urlsToVisit = [];
  var visitedURLs =[];
  var title;
  const crawler = await HCCrawler.launch({
  // Function to be evaluated in browsers
  evaluatePage: (() => ({
    title: $('title').text(),
    link: $('a').attr('href'),
    linkslen: $('a').length,
})),
// Function to be called with evaluated results from browsers
onSuccess: (result => {
  console.log(result.links)
  title = result.result.title;
  result.result.link.map((link)=>{
    urlsToVisit.push(result.result.link)
  })
}),
});



await crawler.queue({
  url: 'http://books.toscrape.com',
  maxDepth :0
});
await crawler.queue({
  url: [urlsToVisit],
  maxDepth :0
});

await crawler.onIdle(); // Resolved when no queue is left
await crawler.close(); // Close the crawler
})();

那么,我该怎么办?

我的日志:

(node:4909) UnhandledPromiseRejectionWarning: TypeError [ERR_INVALID_ARG_TYPE]: The "url" argument must be of type string. Received type object
    at Url.parse (url.js:143:11)
    at urlParse (url.js:137:13)
    at Promise.all.map (/home/ubuntu/workspace/node_modules/headless-chrome-crawler/lib/hccrawler.js:167:27)
    at arrayMap (/home/ubuntu/workspace/node_modules/headless-chrome-crawler/node_modules/lodash/_arrayMap.js:16:21)
    at map (/home/ubuntu/workspace/node_modules/headless-chrome-crawler/node_modules/lodash/map.js:50:10)
    at HCCrawler.queue (/home/ubuntu/workspace/node_modules/headless-chrome-crawler/lib/hccrawler.js:157:23)
    at HCCrawler.<anonymous> (/home/ubuntu/workspace/node_modules/headless-chrome-crawler/lib/helper.js:177:23)
    at /home/ubuntu/workspace/crawlertop.js:30:17
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:118:7)
(node:4909) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 3)
(node:4909) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
[ 'http://books.toscrape.com/index.html',
  'http://books.toscrape.com/catalogue/category/books_1/index.html',
  'http://books.toscrape.com/catalogue/category/books/travel_2/index.html',
  'http://books.toscrape.com/catalogue/category/books/mystery_3/index.html',
  'http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html',
  'http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html',
  'http://books.toscrape.com/catalogue/category/books/classics_6/index.html',
  'http://books.toscrape.com/catalogue/category/books/philosophy_7/index.html',
  'http://books.toscrape.com/catalogue/category/books/romance_8/index.html',
  'http://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html',
  'http://books.toscrape.com/catalogue/category/books/fiction_10/index.html',
  'http://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
  'http://books.toscrape.com/catalogue/category/books/religion_12/index.html',
  'http://books.toscrape.com/catalogue/category/books/nonfiction_13/index.html',
  'http://books.toscrape.com/catalogue/category/books/music_14/index.html',
  'http://books.toscrape.com/catalogue/category/books/default_15/index.html',
  'http://books.toscrape.com/catalogue/category/books/science-fiction_16/index.html',
  'http://books.toscrape.com/catalogue/category/books/sports-and-games_17/index.html',
  'http://books.toscrape.com/catalogue/category/books/add-a-comment_18/index.html',
  'http://books.toscrape.com/catalogue/category/books/fantasy_19/index.html',
  'http://books.toscrape.com/catalogue/category/books/new-adult_20/index.html',
  'http://books.toscrape.com/catalogue/category/books/young-adult_21/index.html',
  'http://books.toscrape.com/catalogue/category/books/science_22/index.html',
  'http://books.toscrape.com/catalogue/category/books/poetry_23/index.html',
  'http://books.toscrape.com/catalogue/category/books/paranormal_24/index.html',
  'http://books.toscrape.com/catalogue/category/books/art_25/index.html',
  'http://books.toscrape.com/catalogue/category/books/psychology_26/index.html',
  'http://books.toscrape.com/catalogue/category/books/autobiography_27/index.html',
  'http://books.toscrape.com/catalogue/category/books/parenting_28/index.html',
  'http://books.toscrape.com/catalogue/category/books/adult-fiction_29/index.html',
  'http://books.toscrape.com/catalogue/category/books/humor_30/index.html',
  'http://books.toscrape.com/catalogue/category/books/horror_31/index.html',
  'http://books.toscrape.com/catalogue/category/books/history_32/index.html',
  'http://books.toscrape.com/catalogue/category/books/food-and-drink_33/index.html',
  'http://books.toscrape.com/catalogue/category/books/christian-fiction_34/index.html',
  'http://books.toscrape.com/catalogue/category/books/business_35/index.html',
  'http://books.toscrape.com/catalogue/category/books/biography_36/index.html',
  'http://books.toscrape.com/catalogue/category/books/thriller_37/index.html',
  'http://books.toscrape.com/catalogue/category/books/contemporary_38/index.html',
  'http://books.toscrape.com/catalogue/category/books/spirituality_39/index.html',
  'http://books.toscrape.com/catalogue/category/books/academic_40/index.html',
  'http://books.toscrape.com/catalogue/category/books/self-help_41/index.html',
  'http://books.toscrape.com/catalogue/category/books/historical_42/index.html',
  'http://books.toscrape.com/catalogue/category/books/christian_43/index.html',
  'http://books.toscrape.com/catalogue/category/books/suspense_44/index.html',
  'http://books.toscrape.com/catalogue/category/books/short-stories_45/index.html',
  'http://books.toscrape.com/catalogue/category/books/novels_46/index.html',
  'http://books.toscrape.com/catalogue/category/books/health_47/index.html',
  'http://books.toscrape.com/catalogue/category/books/politics_48/index.html',
  'http://books.toscrape.com/catalogue/category/books/cultural_49/index.html',
  'http://books.toscrape.com/catalogue/category/books/erotica_50/index.html',
  'http://books.toscrape.com/catalogue/category/books/crime_51/index.html',
  'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
  'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
  'http://books.toscrape.com/catalogue/soumission_998/index.html',
  'http://books.toscrape.com/catalogue/sharp-objects_997/index.html',
  'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
  'http://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
  'http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
  'http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
  'http://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
  'http://books.toscrape.com/catalogue/the-black-maria_991/index.html',
  'http://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html',
  'http://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html',
  'http://books.toscrape.com/catalogue/set-me-free_988/index.html',
  'http://books.toscrape.com/catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html',
  'http://books.toscrape.com/catalogue/rip-it-up-and-start-again_986/index.html',
  'http://books.toscrape.com/catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html',
  'http://books.toscrape.com/catalogue/olio_984/index.html',
  'http://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html',
  'http://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html',
  'http://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html',
  'http://books.toscrape.com/catalogue/page-2.html' ]
(node:4909) UnhandledPromiseRejectionWarning: Error: Protocol error: Connection closed. Most likely the page has been closed.
    at assert (/home/ubuntu/workspace/node_modules/headless-chrome-crawler/node_modules/puppeteer/lib/helper.js:251:11)
    at Page.close (/home/ubuntu/workspace/node_modules/headless-chrome-crawler/node_modules/puppeteer/lib/Page.js:883:5)
    at Crawler.close (/home/ubuntu/workspace/node_modules/headless-chrome-crawler/lib/crawler.js:80:22)
    at Crawler.<anonymous> (/home/ubuntu/workspace/node_modules/headless-chrome-crawler/lib/helper.js:177:23)
    at HCCrawler._request (/home/ubuntu/workspace/node_modules/headless-chrome-crawler/lib/hccrawler.js:349:21)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:118:7)
(node:4909) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 9)

2 个答案:

答案 0 :(得分:1)

您的代码有多个问题。我会一一去。

问题:public class pacioWebSocketListener extends WebSocketListener { private static final int NORMAL_CLOSURE_STATUS = 1000; private static final String TAG = "===WsConnectivity==="; public WebSocket ws; public MessageListener msgListener; public pacioWebSocketListener(){ } public interface MessageListener{ void onMessageReceived(String message); } public void setPacioWebSocketListener(MessageListener mylistener){ this.msgListener = mylistener; } @Override public void onOpen(WebSocket webSocket, Response response) { super.onOpen(webSocket, response); Log.v(TAG,"onOpen"); } @Override public void onMessage(WebSocket webSocket, String text) { super.onMessage(webSocket, text); Log.v(TAG,"pacioMsg : " + text); if(this.msgListener != null){ this.msgListener.onMessageReceived(text); } else { Log.d(TAG,"interface is null"); } } 上的代码错误

  • 您提到了public class MainActivity extends AppCompatActivity { private String TAG = "===MainActivity==="; pacioWebSocketListener myWS = new pacioWebSocketListener(); TextView tv; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); getWindow().addFlags(WindowManager.LayoutParams.FLAG_KEEP_SCREEN_ON); myWS.startConnection(); myWS.setPacioWebSocketListener(new pacioWebSocketListener.MessageListener() { @Override public void onMessageReceived(String message) { Log.d(TAG,"incomming: " + message); } }); ,但是结果为onSuccess,因此路径应改为result.result.link
  • 地图功能不使用links,您将相同的数据一遍又一遍地推送到result.links

问题:持续爬网时逻辑错误

您需要抓取两个部分,

  • 一种方法是浏览目标页面并收集链接,
  • 另一种方法是浏览收集的链接。

您需要分别考虑它们。

此外,无论您何时link都会立即调用,但是您的urlsToVisit尚未完成。它可能根本没有任何数据。

解决方案

  • 以递归方式将链接排队。每当它完成爬网时,都应将新链接排队回到爬网程序。
  • 我们还要确保使用.queue来捕获错误。

这是一个有效的代码,

urlsToVisit

问题:此解决方案无法解决我的问题

您将很快意识到,您没有在抓取onError,因为它是在使用自己的方法抓取所有内容。

这就是为什么程序包具有(async () => { var visitedURLs = []; const crawler = await HCCrawler.launch({ // Function to be evaluated in browsers evaluatePage: () => ({ title: $("title").text(), link: $("a").attr("href"), linkslen: $("a").length }), // Function to be called with evaluated results from browsers onSuccess: async result => { // save them as wish visitedURLs.push(result.options.url); // show some progress console.log(visitedURLs.length, result.options.url); // queue new links one by one asynchronously for (const link of result.links) { await crawler.queue({ url: link, maxDepth: 0 }); } }, // catch all errors onError: error => { console.log(error); } }); await crawler.queue({ url: "http://books.toscrape.com", maxDepth: 0 }); await crawler.onIdle(); // Resolved when no queue is left await crawler.close(); // Close the crawler })(); 选项的原因。这样它就可以通过递归功能完全遍历整个网站。阅读他们的文档,尝试一点一点地理解它。

最重要的是,您必须将代码分成多个部分并一次解决一个问题。

随时浏览文档中的其他选项。

答案 1 :(得分:0)

您收到错误UnhandledPromiseRejectionWarning: TypeError [ERR_INVALID_ARG_TYPE]: The "url" argument must be of type string. Received type object

该错误表明"url"的类型为object,而不是string。问题就在这里

await crawler.queue({
  url: [urlsToVisit], // This is an array not a string
  maxDepth :0
});

您将需要for循环才能像这样遍历数组urlsToVisit中的每个URL

urlsToVisit.forEach(function(u) {
  await crawler.queue({
      url: u,
      maxDepth :0
    });
});

您的日志也显示UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 3)。使用try/catch块,这样就不会弹出该错误