Question

我正在抓取https://www.dailynews.co.th中的文本，这是我的问题。

我的蜘蛛起初工作得差不多，爬了大约4000页。

text <U+2192> x

然后，它开始从几乎所有的网址中引发大量的TimeoutErrors。

2018-09-28 20:05:00 [scrapy.extensions.logstats] INFO: Crawled 4161 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

这是我的第二次尝试，我将CONCURRENT_REQUESTS从32减少到16，将AUTOTHROTTLE_TARGET_CONCURRENCY从32.0减少到4.0，并将DOWNLOAD_TIMEOUT从15减少到5。问题没有解决，但是我得到的页面比第一次尝试要多（从1000到4000）

我还尝试了2018-09-28 20:06:06 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.dailynews.co.th/tags/When%20Will%20You%20Marry> Traceback (most recent call last): File "/usr/local/app/.local/share/virtualenvs/monolingual-6kEg5ui2/lib/python2.7/site-packages/Twisted-18.7.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 1416, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/usr/local/app/.local/share/virtualenvs/monolingual-6kEg5ui2/lib/python2.7/site-packages/Twisted-18.7.0-py2.7-linux-x86_64.egg/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator return g.throw(self.type, self.value, self.tb) File "/usr/local/app/.local/share/virtualenvs/monolingual-6kEg5ui2/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request defer.returnValue((yield download_func(request=request,spider=spider))) File "/usr/local/app/.local/share/virtualenvs/monolingual-6kEg5ui2/lib/python2.7/site-packages/Twisted-18.7.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/local/app/.local/share/virtualenvs/monolingual-6kEg5ui2/lib/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 351, in _cb_timeout raise TimeoutError("Getting %s took longer than %s seconds." % (url, timeout)) TimeoutError: User timeout caused connection failure: Getting https://www.dailynews.co.th/tags/When%20Will%20You%20Marry took longer than 5.0 seconds..失败的网址（当我的蜘蛛仍在运行时），我得到了200的响应，这意味着连接本身很好。

我想知道我是否被禁令或其他原因，有人可以给我一个线索吗？非常感谢。

仅供参考，这是我的设置文件。

scrapy shell

这是我的蜘蛛代码。

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'protocol.middlewares.RotateUserAgentMiddleware': 110,
'protocol.middlewares.MaximumAbsoluteDepthFilterMiddleware': 80,
'protocol.middlewares.ProxyMiddleware': 543
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 0.1
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 10
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = False

DOWNLOAD_TIMEOUT = 5
DEPTH_LIMIT = 100
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

过一会儿后再抓狂

0 个答案: