Scrapy Crawling Speed非常慢(每分钟6页)!

时间:2017-08-18 09:24:40

标签: python scrapy

我是scrapy的新手,我使用scrapy startproject zhanlang构建了一个项目。 但是当我通过scrapy crawl zhanlang -o zhanlang.csv启动蜘蛛时,它的工作速度非常慢!只有6个/分钟!这是我的代码:

def after_login(self, response):
    #the site should log in,this function is TODO after login
    yield Request(url="https://movie.douban.com/subject/26363254/comments?start=0&limit=20&sort=new_score&status=P",  
                   meta={'cookiejar': response.meta['cookiejar']},  
                   callback=self.parse
                   )  


def parse(self,response):
    item = ZhanlangItem()
    for comment in response.xpath('//div[@class="comment-item"]'):
        item['name'] = comment.xpath('./div[@class="avatar"]/a/@title').extract_first()
        item['text'] = comment.xpath('./div[@class="comment"]/p/text()').extract()
        item['vote'] = comment.xpath('.//span[@class="votes"]/text()').extract_first()
        yield item
    next_page_url = response.xpath('//a[@class="next"]/@href').extract()[0]
    next_page_url = "https://movie.douban.com/subject/26363254/comments"+next_page_url
    if next_page_url is not None:
        print next_page_url
        yield Request(url=next_page_url,
                    meta={'cookiejar': response.meta['cookiejar']},  
                   callback=self.parse
                       )

这是我的设置:

DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

DOWNLOADER_MIDDLEWARES = {  
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,  
    'zhanlang.middlewares.RandomUserAgentMiddleware':400,  
}

我的middlewares.py是:

from fake_useragent import UserAgent
import requests, random, json

import base64

class RandomUserAgentMiddleware(object):
# random choice useragent
    def __init__(self, crawler):
        super(RandomUserAgentMiddleware, self).__init__()
        self.ua = UserAgent()
        self.ua_type = crawler.settings.get('RANDOM_UA_TYPE', 'random')
    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)
    def process_request(self, request, spider):
        def get_ua():
            return getattr(self.ua, self.ua_type)
        request.headers.setdefault('User-Agent', get_ua())

为什么它如此缓慢地爬行?我该怎么做才能提高速度?谢谢

0 个答案:

没有答案