Question

我正在尝试使用scrapy（scrapy -V：1.5，Python -V：3.5）从网站中提取一些信息，它已经允许所有的robots.txt，首先它不允许我抓取任何链接并返回INFO: Ignoring response <403 https://www.xxxx.com>: HTTP status code is not handled or not allowed，通过使用useragent解决它几乎没有超过150个链接，所以我通过使用 TOR 和 useragent 来管理它，但现在问题是几条链接仍在阻止我[scrapy.extensions.logstats] INFO: Crawled 1120 pages (at 5 pages/min), scraped 1055 items (at 4 items/min) 任何帮助都会受到很大的关注 TIA

# settings.py
DOWNLOAD_DELAY = 5
COOKIES_ENABLED = False

Answer 1

似乎使用Tor已经让你绕过禁令，但（当然）Tor退出节点的某些百分比（已经）被禁止（如果你使用刮刀，它可能会得到更多）。 / p>

所以你还需要做的是重试那些被禁止的请求。

如果您正在执行复杂的POST请求，则需要构建自己的重试逻辑。但对我来说，似乎您只是关注链接，因此激活响应代码403的重试应该可以帮助您：

# settings.py

RETRY_HTTP_CODES = [403] # add here any status codes you receive when getting banned
RETRY_TIMES = 10 # depending on the percentage of blocked tor exit IPs and 
          # the amount of links you crawl you might have to increase this number

PS：并且远离COOKIES_ENABLED = True，这可以让您被禁止非常快，因为它允许网站在所有IP上跟踪您的抓取工具。如果该网站需要您发送cookie（大多数没有），您将不得不为每个IP使用单独的cookie jar

scrapy：被目标网站禁止

1 个答案: