我在scrappinghub上使用scrapy来刮掉数千个网站。 抓取单个网站时,请求的持续时间非常短(<100毫秒)。
但是我还有一个蜘蛛,负责“验证”大约1万个URL(我正在测试一堆不同的域,无论是否带有www。),它所做的只是抓取首页,而放弃状态是200或重定向。
我注意到,当连续运行几次该Spider时,结果不一致(项目和请求的数量不同)。
查看请求日志时,我看到请求持续时间逐渐变长,然后又回到较小的数字,然后甚至更高,直到在某些URL上触发用户超时为止。
我通常使用的CONCURENT_REQUESTS
大于100(我尝试过100、200、500、1000)
这里是持续时间日志,这里没有超时,因为只有100个网址,但是我需要在10k个网址上运行此验证,并且这种持续时间不稳定是一个问题:
{"time": 1535517660373, "duration": 26, "status": 400}
{"time": 1535517661582, "duration": 26, "status": 400}
{"time": 1535517663724, "duration": 26, "status": 400}
{"time": 1535517663897, "duration": 26, "status": 400}
{"time": 1535517665046, "duration": 46, "status": 200}
{"time": 1535517657573, "duration": 50, "status": 200}
{"time": 1535517657615, "duration": 83, "status": 200}
{"time": 1535517657616, "duration": 85, "status": 200}
{"time": 1535517657822, "duration": 112, "status": 200}
{"time": 1535517657831, "duration": 112, "status": 200}
{"time": 1535517657816, "duration": 120, "status": 200}
{"time": 1535517657837, "duration": 121, "status": 200}
{"time": 1535517658470, "duration": 130, "status": 200}
{"time": 1535517663093, "duration": 135, "status": 302}
{"time": 1535517658133, "duration": 149, "status": 200}
{"time": 1535517657862, "duration": 153, "status": 200}
{"time": 1535517657933, "duration": 228, "status": 200}
{"time": 1535517658362, "duration": 230, "status": 200}
{"time": 1535517657946, "duration": 258, "status": 200}
{"time": 1535517657989, "duration": 269, "status": 200}
{"time": 1535517657967, "duration": 271, "status": 200}
{"time": 1535517658108, "duration": 389, "status": 200}
{"time": 1535517665893, "duration": 433, "status": 404}
{"time": 1535517658142, "duration": 467, "status": 200}
{"time": 1535517658350, "duration": 467, "status": 200}
{"time": 1535517668501, "duration": 526, "status": 200}
{"time": 1535517658216, "duration": 543, "status": 200}
{"time": 1535517658312, "duration": 670, "status": 200}
{"time": 1535517658342, "duration": 678, "status": 200}
{"time": 1535517658347, "duration": 679, "status": 200}
{"time": 1535517658291, "duration": 682, "status": 200}
{"time": 1535517658345, "duration": 684, "status": 200}
{"time": 1535517658310, "duration": 688, "status": 200}
{"time": 1535517658333, "duration": 688, "status": 200}
{"time": 1535517658336, "duration": 689, "status": 200}
{"time": 1535517658317, "duration": 690, "status": 200}
{"time": 1535517658314, "duration": 694, "status": 200}
{"time": 1535517658339, "duration": 696, "status": 200}
{"time": 1535517658319, "duration": 697, "status": 200}
{"time": 1535517658315, "duration": 701, "status": 200}
{"time": 1535517658349, "duration": 701, "status": 200}
{"time": 1535517658322, "duration": 703, "status": 200}
{"time": 1535517658327, "duration": 703, "status": 200}
{"time": 1535517658377, "duration": 704, "status": 200}
{"time": 1535517658309, "duration": 708, "status": 200}
{"time": 1535517658376, "duration": 710, "status": 200}
{"time": 1535517658374, "duration": 711, "status": 200}
{"time": 1535517658335, "duration": 717, "status": 200}
{"time": 1535517658344, "duration": 720, "status": 200}
{"time": 1535517658338, "duration": 728, "status": 200}
{"time": 1535517658372, "duration": 728, "status": 200}
{"time": 1535517658324, "duration": 732, "status": 200}
{"time": 1535517658360, "duration": 748, "status": 200}
{"time": 1535517658341, "duration": 753, "status": 200}
{"time": 1535517658396, "duration": 797, "status": 200}
{"time": 1535517658408, "duration": 801, "status": 200}
{"time": 1535517658529, "duration": 938, "status": 200}
{"time": 1535517658579, "duration": 994, "status": 200}
{"time": 1535517658607, "duration": 996, "status": 200}
{"time": 1535517658604, "duration": 1001, "status": 200}
{"time": 1535517658611, "duration": 1006, "status": 200}
{"time": 1535517658606, "duration": 1022, "status": 200}
{"time": 1535517658707, "duration": 1104, "status": 200}
{"time": 1535517658634, "duration": 1110, "status": 200}
{"time": 1535517658772, "duration": 1166, "status": 200}
{"time": 1535517658859, "duration": 1236, "status": 200}
{"time": 1535517658956, "duration": 1348, "status": 200}
{"time": 1535517659025, "duration": 1358, "status": 200}
{"time": 1535517658958, "duration": 1368, "status": 200}
{"time": 1535517658959, "duration": 1373, "status": 200}
{"time": 1535517658985, "duration": 1408, "status": 200}
{"time": 1535517658960, "duration": 1426, "status": 200}
{"time": 1535517659349, "duration": 1445, "status": 200}
{"time": 1535517659469, "duration": 1583, "status": 200}
{"time": 1535517659283, "duration": 1694, "status": 200}
{"time": 1535517659278, "duration": 1712, "status": 200}
{"time": 1535517659620, "duration": 2033, "status": 200}
{"time": 1535517660588, "duration": 2400, "status": 200}
{"time": 1535517660353, "duration": 2819, "status": 200}
{"time": 1535517660756, "duration": 3194, "status": 200}
{"time": 1535517660752, "duration": 3214, "status": 200}
{"time": 1535517661403, "duration": 3216, "status": 200}
{"time": 1535517660889, "duration": 3316, "status": 200}
{"time": 1535517661535, "duration": 3371, "status": 200}
{"time": 1535517661407, "duration": 3848, "status": 200}
{"time": 1535517661966, "duration": 4436, "status": 200}
{"time": 1535517662355, "duration": 4463, "status": 200}
{"time": 1535517662153, "duration": 4613, "status": 200}
{"time": 1535517662336, "duration": 4814, "status": 200}
{"time": 1535517664132, "duration": 6594, "status": 200}
{"time": 1535517681367, "duration": 23480, "status": 200}
{"time": 1535517683665, "duration": 26104, "status": 200}
{"time": 1535517685281, "duration": 27744, "status": 200}
{"time": 1535517691127, "duration": 33598, "status": 200}
{"time": 1535517692933, "duration": 35454, "status": 200}
{"time": 1535517693278, "duration": 35764, "status": 200}
{"time": 1535517693337, "duration": 35812, "status": 200}
{"time": 1535517693972, "duration": 36459, "status": 200}
{"time": 1535517694212, "duration": 36701, "status": 200}
{"time": 1535517694576, "duration": 37071, "status": 200}
我的蜘蛛:
from scrapy.spiders import Spider
from scrapy import Request
import pkgutil
from ...utils.parse import parse
from ...utils.errback_httpbin import errback_httpbin
class QuotesSpider(Spider):
name = "validation_2"
rotate_user_agent = True
def start_requests(self):
urls = pkgutil.get_data("qwarx_spiders", "resources/urls_100.txt").decode('utf-8').splitlines()
for url in urls:
yield Request(url=url, callback=self.parse, errback=self.errback_httpbin)
def parse(self, response):
return parse(self, response)
def errback_httpbin(self, failure):
return errback_httpbin(self, failure)
解析方法:
from ..items.broad import URL
from scrapy.exceptions import NotSupported
def getDomain(url):
spltAr = url.split("://")
i = (0, 1)[len(spltAr) > 1]
dm = spltAr[i].split("?")[0].split('/')[0].split(':')[0].lower()
return dm.replace('www.', '')
def parse(self, response):
item = URL()
id = {}
id['url'] = response.url
id['domain'] = getDomain(response.url)
try:
id['title'] = response.xpath("//title/text()").extract_first()
if id['title'] is not None:
id['title'] = id['title'].strip()
except (AttributeError, NotSupported) as e:
yield None
meta_names = response.xpath("//meta/@name").extract()
meta_properties = response.xpath("//meta/@property").extract()
meta = {}
content = {}
if 'description' in meta_names:
meta['description'] = response.xpath("//meta[@name='description']/@content").extract_first()
else:
if 'og:description' in meta_properties:
meta['description'] = response.xpath("//meta[@property='og:description']/@content").extract_first()
else:
meta['description'] = ''
if 'og:image' in meta_names:
meta['image'] = response.xpath("//meta[@name='og:image']/@content").extract_first()
else:
if 'og:image' in meta_properties:
meta['image'] = response.xpath("//meta[@property='og:image']/@content").extract_first()
else:
meta['image'] = ''
content['p'] = response.xpath('//p/text()').extract_first()
if content['p'] is not None:
content['p'] = list(map(lambda x: x.strip()[:150], response.xpath('//p/text()').extract()))[:4]
if 'redirect_urls' in response.meta:
meta['redirect_urls'] = response.meta['redirect_urls']
item['id'] = id
item['content'] = content
item['meta'] = meta
yield item
errback_httpbin:
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError
def errback_httpbin(self, failure):
# log all errback failures,
# in case you want to do something special for some errors,
# you may need the failure's type
self.logger.error(repr(failure))
# if isinstance(failure.value, HttpError):
if failure.check(HttpError):
# you can get the response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
# elif isinstance(failure.value, DNSLookupError):
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
# elif isinstance(failure.value, TimeoutError):
elif failure.check(TimeoutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
settings.py:
SPIDER_MODULES = ['qwarx_spiders.spiders.broad', 'qwarx_spiders.spiders.custom', 'qwarx_spiders.spiders.validation']
NEWSPIDER_MODULE = 'qwarx_spiders.spiders'
SPIDER_MIDDLEWARES = {
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': True,
}
DOWNLOADER_MIDDLEWARES = {
'qwarx_spiders.middlewares.FilterDomainbyLimitMiddleware': 200,
'qwarx_spiders.middlewares.RotateUserAgentMiddleware': 110,
}
ITEM_PIPELINES = {
'qwarx_spiders.pipelines.DuplicatesPipeline': 300,
}
EXTENSIONS = {
'scrapy_dotpersistence.DotScrapyPersistence': 0
}
BOT_NAME = 'Qwarx'
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 ' \
'(KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.3'
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'INFO'
CONCURRENT_REQUESTS = 1000
REACTOR_THREADPOOL_MAXSIZE = 1000
DOWNLOAD_DELAY = 0
COOKIES_ENABLED = False
REDIRECT_ENABLED = True
AJAXCRAWL_ENABLED = True
AUTOTHROTTLE_ENABLED = False
RETRY_ENABLED = True
DOWNLOAD_TIMEOUT = 60
DNSCACHE_ENABLED=True
DNSCACHE_SIZE=100000
CRAWL_LIMIT_PER_DOMAIN = 100000
URLLENGTH_LIMIT = 180
USER_AGENT_CHOICES = [
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:23.0) Gecko/20100101 Firefox/23.0',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.62 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20140205 Firefox/24.0 Iceweasel/24.3.0',
'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0',
'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
]
URLLENGTH_LIMIT=180
答案 0 :(得分:0)
所以我找到了解决问题的方法。
抓取许多域时,我遇到了一堆“假否定”,这意味着当连续多次对10k网址运行验证抓取时,我将永远不会获得相同数量的结果。
但是,我已经设置了一个旋转代理系统(通过Crawlera),并且现在完全稳定。