Question

我正在尝试以非常基本的方式抓取网站。但Scrapy没有抓取所有链接。我将解释这个场景如下 -

main_page.html - ＆gt;包含指向a_page.html，b_page.html，c_page.html的链接 a_page.html - ＆gt;包含指向a1_page.html，a2_page.html的链接 b_page.html - ＆gt;包含指向b1_page.html，b2_page.html的链接 c_page.html - ＆gt;包含指向c1_page.html，c2_page.html的链接 a1_page.html - ＆gt;包含指向b_page.html的链接 a2_page.html - ＆gt;包含指向c_page.html的链接 b1_page.html - ＆gt;包含指向a_page.html的链接 b2_page.html - ＆gt;包含指向c_page.html的链接 c1_page.html - ＆gt;包含指向a_page.html的链接 c2_page.html - ＆gt;包含指向main_page.html的链接

我在CrawlSpider中使用以下规则 -

Rule(SgmlLinkExtractor(allow = ()), callback = 'parse_item', follow = True))

但抓取结果如下 -

DEBUG：Crawled（200）http：//localhost/main_page.html> （引荐：无）2011-12-05 09：56：07 + 0530 [test_spider] DEBUG：Crawled（200）http：//localhost/a_page.html> （引荐： http://localhost/main_page.html）2011-12-05 09：56：07 + 0530 [test_spider] DEBUG：Crawled（200）http：//localhost/a1_page.html> （referer：http://localhost/a_page.html）2011-12-05 09：56：07 + 0530 [test_spider] DEBUG：Crawled（200）http：//localhost/b_page.html> （referer：http://localhost/a1_page.html）2011-12-05 09：56：07 + 0530 [test_spider] DEBUG：Crawled（200）http：//localhost/b1_page.html> （referer：http://localhost/b_page.html）2011-12-05 09：56：07 + 0530 [test_spider]信息：关闭蜘蛛（已完成）

它没有抓取所有网页。

注意 - 我已经在Scrapy Doc。

中指出了在BFO中进行爬行

我错过了什么？

Answer 1

Scrapy默认会过滤掉所有重复的请求。

您可以使用（示例）：

来规避这一点

yield Request(url="test.com", callback=self.callback, dont_filter = True)

dont_filter（boolean） - 表示此请求不应该是由调度程序过滤。当您想要执行时使用此选项多次相同的请求，忽略重复过滤器。使用它小心翼翼，否则你会陷入爬行循环。默认为False。

另见Request object documentation

Answer 2

我今天遇到了类似的问题，虽然我使用的是自定义蜘蛛。原来，该网站限制了我的抓取，因为我的useragent是scrappy-bot

尝试更改您的用户代理，然后重试。将其更改为可能的已知浏览器

您可能想要尝试的另一件事是添加延迟。如果请求之间的时间太短，某些网站会阻止抓取。尝试添加DOWNLOAD_DELAY为2并查看是否有帮助

有关DOWNLOAD_DELAY的更多信息 http://doc.scrapy.org/en/0.14/topics/settings.html

Answer 3

也许很多网址都是重复的。 Scrapy避免重复，因为它效率低下。从我使用跟随网址规则后的解释中看，当然，有很多重复。

如果您想确定并在日志中查看证明，请将其添加到settings.py。

DUPEFILTER_DEBUG = True

你会在日志中看到这种行：

2016-09-20 17:08:47 [scrapy] DEBUG：已过滤的重复请求：http：//www.example.org/example.html>

Scrapy没有抓取所有页面

3 个答案: