我正在以64 GB RAM和32处理器的服务器配置在本地模式下运行nutch。如果我在种子列表中有一个URL,并且在nutchsite.xml中具有以下配置
fetcher.threads.fetch =16
fetcher.threads.per.queue=2
fetcher.max.crawl.delay=120
fetcher.queue.depth.multiplier=150
fetcher.queue.mode=byHost
如果-topN设置为1000,则在Fetch阶段将对该URL发出多少请求 将为Fetcher创建多个映射任务,我理解的是创建单个映射任务,而不管需要从fetchlist提取的url数量如何 我尝试使用fetcher.threads.per.queue来搜索fetcher.threads.fetch之间的关系,但是却发现任何明显的东西 还从访存阶段添加了日志
FetcherThread INFO fetcher.FetcherThread (277) - fetching
http://investors.te.com/news-releases/press-release-details/2018/TE-
Connectivity-announces-fourth-quarter-and-full-year-resu
lts-for-fiscal-year-2018/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching http://investors.te.com/shareholder-info/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/news-releases/press-release-details/2019/TE-Connectivity-to-hold-annual-general-meeting-of-shareholders-on-March-13-2019/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/request-information/default.aspx (queue crawl delay=2000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/email-alerts/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/site-map/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/rss/PressRelease.aspx?LanguageId=1&CategoryWorkflowId=00000000-0000-0000-0000-000000000000&tags= (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/stock-information/quote-and-chart/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/overview/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/investor-resources/investor-contacts/default.aspx (queue crawl delay=10000ms)
FetcherThread INFO fetcher.FetcherThread (277) - fetching https://investors.te.com/js/mobileRedirect.js (queue crawl delay=10000ms)
答案 0 :(得分:0)
将只有一个请求,因为只有一个URL。如果来自带有fetcher.threads.per.queue=2
的单个主机的两个URL,则可能有两个同时请求到同一主机。仅当您要爬网的主机数量很多或正在爬网自己的本地快速响应Web服务器时,大量的fetcher.threads.fetch
才有意义。在后一种情况下,fetcher.threads.per.queue
应该等于或接近fetcher.threads.fetch
。如果不是您自己的服务器,并且没有明确允许您使用,则应始终保留fetcher.threads.per.queue
的默认值,该默认值是一个单线程(= 1),没有与同一主机的并行连接,并且保证了连续请求之间的延迟。