我正在抓取https://www.dailynews.co.th中的文本,这是我的问题。
我的蜘蛛起初工作得差不多,爬了大约4000页。
text <U+2192> x
然后,它开始从几乎所有的网址中引发大量的TimeoutErrors。
2018-09-28 20:05:00 [scrapy.extensions.logstats] INFO: Crawled 4161 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
这是我的第二次尝试,我将CONCURRENT_REQUESTS从32减少到16,将AUTOTHROTTLE_TARGET_CONCURRENCY从32.0减少到4.0,并将DOWNLOAD_TIMEOUT从15减少到5。问题没有解决,但是我得到的页面比第一次尝试要多(从1000到4000)
我还尝试了2018-09-28 20:06:06 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.dailynews.co.th/tags/When%20Will%20You%20Marry>
Traceback (most recent call last):
File "/usr/local/app/.local/share/virtualenvs/monolingual-6kEg5ui2/lib/python2.7/site-packages/Twisted-18.7.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/local/app/.local/share/virtualenvs/monolingual-6kEg5ui2/lib/python2.7/site-packages/Twisted-18.7.0-py2.7-linux-x86_64.egg/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/app/.local/share/virtualenvs/monolingual-6kEg5ui2/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/usr/local/app/.local/share/virtualenvs/monolingual-6kEg5ui2/lib/python2.7/site-packages/Twisted-18.7.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/app/.local/share/virtualenvs/monolingual-6kEg5ui2/lib/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 351, in _cb_timeout
raise TimeoutError("Getting %s took longer than %s seconds." % (url, timeout))
TimeoutError: User timeout caused connection failure: Getting https://www.dailynews.co.th/tags/When%20Will%20You%20Marry took longer than 5.0 seconds..
失败的网址(当我的蜘蛛仍在运行时),我得到了200的响应,这意味着连接本身很好。
我想知道我是否被禁令或其他原因,有人可以给我一个线索吗?非常感谢。
仅供参考,这是我的设置文件。
scrapy shell
这是我的蜘蛛代码。
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'protocol.middlewares.RotateUserAgentMiddleware': 110,
'protocol.middlewares.MaximumAbsoluteDepthFilterMiddleware': 80,
'protocol.middlewares.ProxyMiddleware': 543
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 0.1
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 10
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = False
DOWNLOAD_TIMEOUT = 5
DEPTH_LIMIT = 100
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'