我已经在文档的帮助下成功实现了Scrapy中的暂停/恢复(https://doc.scrapy.org/en/latest/topics/jobs.html)我还可以通过调整示例来抓取多个页面来填充一个csv行中的一个项目的值({{ 3}})。但是,我似乎无法同时使用这两种功能,因此我有一个蜘蛛可以从每个项目的两个页面上擦除,并且能够暂停并重新启动。
以下是www.beeradvocate.com的尝试。 urls_collection1和urls_collection2是每个> 40,000个URL的列表。
发起
def start_requests(self):
urls_collection1 = pd.read_csv('urls_collection1.csv')
#example url_collection1: 'https://www.beeradvocate.com/community/members/sammy.3853/?card=1'
urls_collection2 = pd.read_csv('urls_collection2.csv')
#example url_collection2: 'https://www.beeradvocate.com/user/beers/?ba=Sammy'
for i in range(len(urls_collection1)):
item = item()
yield scrapy.Request(urls_collection1.iloc[i,0],callback=self.parse1, meta={'item': item})
yield scrapy.Request(urls_collection2.iloc[i,0], callback=self.parse2, meta={'item': item})
#To allow for pause/resume
self.state['items_count'] = self.state.get('items_count', 0) + 1
从第一页解析
def parse1(self, response):
item = response.meta['item']
item['gender_age'] = response.css('.userTitleBlurb .userBlurb').xpath('text()').extract_first()
yield item
从第二页解析
def parse2(self,response):
item = response.meta['item']
item['num_reviews'] = response.xpath('//*[@id="ba-content"]/div/b/text()[2]').extract_first()
return item
除了通过解析1和解析2抓取的数据最终在不同的行而不是与一个项目在同一行上之外,所有内容似乎都能正常工作。
答案 0 :(得分:0)
试试这个:
def start_requests(self):
urls_collection1 = pd.read_csv('urls_collection1.csv')
#example url_collection1: 'https://www.beeradvocate.com/community/members/sammy.3853/?card=1'
urls_collection2 = pd.read_csv('urls_collection2.csv')
#example url_collection2: 'https://www.beeradvocate.com/user/beers/?ba=Sammy'
for i in range(len(urls_collection1)):
item = item()
yield scrapy.Request(urls_collection1.iloc[i,0],
callback=self.parse1,
meta={'item': item,
'collection2_url': urls_collection2.iloc[i,0]})
def parse1(self, response):
collection2_url = respones.meta['collection2_url']
item = response.meta['item']
item['gender_age'] = response.css('.userTitleBlurb .userBlurb').xpath('text()').extract_first()
yield Request(collection2_url,
callback=self.parse2,
meta={'item': item})
def parse2(self,response):
item = response.meta['item']
item['num_reviews'] = response.xpath('//*[@id="ba-content"]/div/b/text()[2]').extract_first()
return item