我是scrapy的新手,我使用scrapy startproject zhanlang
构建了一个项目。
但是当我通过scrapy crawl zhanlang -o zhanlang.csv
启动蜘蛛时,它的工作速度非常慢!只有6个/分钟!这是我的代码:
def after_login(self, response):
#the site should log in,this function is TODO after login
yield Request(url="https://movie.douban.com/subject/26363254/comments?start=0&limit=20&sort=new_score&status=P",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse
)
def parse(self,response):
item = ZhanlangItem()
for comment in response.xpath('//div[@class="comment-item"]'):
item['name'] = comment.xpath('./div[@class="avatar"]/a/@title').extract_first()
item['text'] = comment.xpath('./div[@class="comment"]/p/text()').extract()
item['vote'] = comment.xpath('.//span[@class="votes"]/text()').extract_first()
yield item
next_page_url = response.xpath('//a[@class="next"]/@href').extract()[0]
next_page_url = "https://movie.douban.com/subject/26363254/comments"+next_page_url
if next_page_url is not None:
print next_page_url
yield Request(url=next_page_url,
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse
)
这是我的设置:
DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,
'zhanlang.middlewares.RandomUserAgentMiddleware':400,
}
我的middlewares.py是:
from fake_useragent import UserAgent
import requests, random, json
import base64
class RandomUserAgentMiddleware(object):
# random choice useragent
def __init__(self, crawler):
super(RandomUserAgentMiddleware, self).__init__()
self.ua = UserAgent()
self.ua_type = crawler.settings.get('RANDOM_UA_TYPE', 'random')
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_request(self, request, spider):
def get_ua():
return getattr(self.ua, self.ua_type)
request.headers.setdefault('User-Agent', get_ua())
为什么它如此缓慢地爬行?我该怎么做才能提高速度?谢谢