Scrapy中的并行请求

时间:2016-01-06 13:48:56

标签: python parallel-processing scrapy

我在Scrapy中遇到了这个问题: 我正在尝试在函数item中填充我的parse_additional_info并且这样做我需要在第二次回调中抓取一堆额外的网址parse_player

for path in path_player:
url = path.xpath('url_extractor').extract()[0]
          yield Request(url,meta = {'item' : item}, callback= self.parse_player, priority = 300)

当我这样做时,我的理解是请求稍后异步执行,填充item,但是yield item会立即将其返回为未完成填充。 我知道不可能等待所有yield Request(url,meta = {'item' : item}, callback= self.parse_player, priority = 300)完成,但你怎么解决这个问题?即,当请求中的所有信息都已完成时,确保item收益已完成。

from scrapy.spiders import Spider, CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request
from datetime import datetime
from footscript.items import MatchResultItem
import re, json, string, datetime, uuid

class PreliminarySpider(Spider):
  name = "script"
  start_urls = [
start_url1,
start_url2,
start_url3,
start_url4,
start_url5,
start_url6,
start_url7,
start_url8,
start_url9,
start_url10,
]
  allowed_domains = ['domain.com']

  def parse(self, response):
    sel = Selector(response)
    matches = sel.xpath('match_selector')
    for match in matches:
      try:
        item = MatchResultItem()
        item['url'] = match.xpath('match_url_extractor').extract()[0]
      except Exception:
        print "Unable to get: %s" % match.extract()
      yield Request(url=item['url'] ,meta = {'item' : item}, callback=self.parse_additional_info)

  def parse_additional_info(self, response):
    item = response.request.meta['item']
    sel = Selector(response)

    try:
      item['roun'] = sel.xpath('round_extractor').extract()[0]
      item['stadium'] = sel.xpath('stadium_extractor').extract()[0]
      item['attendance'] = sel.xpath('attendance_extractor').extract()[0]
    except Exception:
      print "Attributes not found at:" % item['url']

    item['player'] = []
    path_player = sel.xpath('path_extractor')
    for path in path_player:
      player = path.xpath('player_extractor').extract()[0]
      player_id = path.xpath('player_d_extractor').extract()[0]
      country = path.xpath('country_extractor').extract()[0]
      item['player'].append([player_id, player, country])
      url = path.xpath('url_extractor').extract()[0]
      yield Request(url,meta = {'item' : item}, callback= self.parse_player, priority = 300)
   # except Exception:
   #   print "Unable to get players"
    yield item

  def parse_player(self, response):
    item = response.request.meta['item']
    sel = Selector(response)
    play_id = re.sub("[^0-9]", "",response.url)
    name = sel.xpath('//div[@class="fdh-wrap contentheader"]/h1/text()').extract()[0].encode('utf-8').rstrip()
    index = [i for i, row in enumerate(item['player']) if play_id in row[0]]
    item['player'][index[0]][1]=name
    return item

修改 新代码:

yield Request(url,meta = {'item' : item}, callback= self.parse_player, errback= self.err_player)
   # except Exception:
   #   print "Unable to get players"
    yield item

    def parse_player(self, response):
      item = response.request.meta['item']
      sel = Selector(response)
      play_id = re.sub("[^0-9]", "",response.url)
      name = sel.xpath('//div[@class="fdh-wrap contentheader"]/h1/text()').extract()[0].encode('utf-8').rstrip()
      index = [i for i, row in enumerate(item['player']) if play_id in row[0]]
      item['player'][index[0]][1]=name
      item['player'][index[0]].append("1")
      return item

    def err_player(self, response):
      print "****************"
      print "Player not found"
      print "****************"
      item = response.request.meta['item']
      play_id = re.sub("[^0-9]", "",response.url)
      index = [i for i, row in enumerate(item['player']) if play_id in row[0]]
      item['player'][index[0]].append("1")
      return item

1 个答案:

答案 0 :(得分:2)

跨多个回调传递项目是非常微妙的练习。它可以在非常简单的情况下工作。 但是,您可以遇到各种问题:

  • 请求失败(您可以使用Request(..., errback=self.my_parse_err)修复它,但为每个请求创建2个回调非常繁琐)
  • 第二个请求包含重复的网址(您可以使用Request(...., dont_filter=True)进行修复并使用HTTPCACHE_ENABLED=True添加到settings.py

从开发角度和生产角度来看,安全路径是为每种类型的页面创建一种类型的项目。然后将2个相关项目组合作为后期处理。

另请注意,如果您有重复的网址,您可能会在商品中找到重复的数据。这也会导致数据库中的数据规范化问题。