在每个域上执行具有高并发性和低请求率的Scrapy广泛爬网。

时间:2016-05-22 23:13:03

标签: python-2.7 concurrency web-scraping scrapy

我正在努力使Scrapy广泛爬行。目标是在不同的域中进行许多并发爬网,但同时在每个域上轻轻爬网。因此能够保持良好的爬行速度并保持每个URL的请求频率低。

这是我使用的蜘蛛:

import re
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from myproject.items import MyprojectItem

class testSpider(CrawlSpider):
    name = "testCrawler16"
    start_urls = [
              "http://example.com",
    ]

    extractor = SgmlLinkExtractor(deny=('.com','.nl','.org'),
                              allow=('.se'))

    rules = (
        Rule(extractor,callback='parse_links',follow=True),
        )

    def parse_links(self, response):
        item = MyprojectItem()
        item['url'] =response.url
        item['depth'] = response.meta['depth']
        yield item

以下是我使用的设置:

BOT_NAME = 'myproject'

SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'

REACTOR_THREADPOOL_MAXSIZE = 20
RETRY_ENABLED = False
REDIRECT_ENABLED = False
DOWNLOAD_TIMEOUT = 15
LOG_LEVEL = 'INFO'
COOKIES_ENABLED = False
DEPTH_LIMIT = 10


AUTOTHROTTLE_ENABLED = True
CONCURRENT_REQUESTS = 10
CONCURRENT_REQUESTS_PER_DOMAIN = 1
AUTOTHROTTLE_TARGET_CONCURRENCY = 1
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60

问题是,经过一段时间后,爬虫爬行的次数越来越少,并且只在少数域之间交替,有时只有一个域。因此,自动油门会降低爬行速度。我如何让蜘蛛保持并发性,并与许多域有许多单独的连接,并使用并发来保持速度,同时保持每个域的请求率低?

2 个答案:

答案 0 :(得分:2)

建议不要将

AUTOTHROTTLE_ENABLED用于快速抓取,我建议将其设置为False,然后轻轻地自行抓取。

此处您需要的唯一设置是CONCURRENT_REQUESTSCONCURRENT_REQUESTS_PER_DOMAIN以及DOWNLOAD_DELAY

DOWNLOAD_DELAY设置为您希望每个域抓取每个请求的值10,例如,如果您每分钟需要6个请求(每10秒一个)。

CONCURRENT_REQUESTS_PER_DOMAIN设置为1以尊重每个域的前一个DOWNLOAD_DELAY间隔。

CONCURRENT_REQUESTS设置为较高的值,它可能是您计划抓取(或更高)的域数。这只是因为它不会干扰以前的设置。

答案 1 :(得分:0)

您可以在以域为键的字典中添加时间戳,而只需抓住最小的数字(最旧的)。然后,从列表中弹出URL或将列表设为全局,然后从管道中删除。

class PoliteSpider(Spider):
name = 'polite_spider'
allowed_urls = ['*']
custom_settings = {
    'ITEM_PIPELINES': {
        'pipelines.MainPipeline': 90,
    },
    'CONCURRENT_REQUESTS': 200,
    'CONCURRENT_REQUESTS_PER_DOMAIN': 25,
    'ROBOTSTXT_OBEY': False,
    'CONCURRENT_ITEMS': 100,
    'REACTOR_THREADPOOL_MAXSIZE': 400,
    # Hides printing item dicts
    'LOG_LEVEL': 'INFO',
    'RETRY_ENABLED': False,
    'REDIRECT_MAX_TIMES': 1,
    # Stops loading page after 5mb
    'DOWNLOAD_MAXSIZE': 5592405,
    # Grabs xpath before site finish loading
    'DOWNLOAD_FAIL_ON_DATALOSS': False
}

def __init__(self):
    self.links = ['www.test.com', 'www.different.org', 'www.pogostickaddict.net']
    self.domain_count = {}

def start_requests(self):
    while self.links:
        start_time = time.time()
        url = next(x for x in self.links if min(domain_count, key=domain_count.get) in x)
        request = scrapy.Request(url, callback=self.parse, dont_filter=True,
                                 meta={'time': time.time(), 'url': url})

        yield request

def parse(self, response):
    domain = response.url.split('//')[-1].split('/')[0]
    self.domain_count[domain] = time.time()

    pageloader = PageItemLoader(PageItem(), response=response)

    pageloader.add_xpath('search_results', '//div[1]/text()')
    self.links.remove(response.meta['url'])

    yield pageloader.load_item()

小例子:

import time
test = {'www.test.com': 1, 'www.different.org': 2, 'www.pogostickaddict.net': 3}
links = ['www.test.com/blog', 'www.different.org/login', 'www.pogostickaddict.net/store/checkout']

url = next(x for x in links if max(test, key=test.get) in x)
print(time.time())
print(links)
print(url)
links.remove(url)
print(links)
print(time.time())

1549868251.3280149
['www.test.com/blog'、'www.different.org/login'、'www.pogostickaddict.net/store/checkout']
www.pogostickaddict.net/store/checkout
['www.test.com/blog','www.different.org/login']
1549868251.328043