我在Scrapy中遇到了这个问题:
我正在尝试在函数item
中填充我的parse_additional_info
并且这样做我需要在第二次回调中抓取一堆额外的网址parse_player
:
for path in path_player:
url = path.xpath('url_extractor').extract()[0]
yield Request(url,meta = {'item' : item}, callback= self.parse_player, priority = 300)
当我这样做时,我的理解是请求稍后异步执行,填充item
,但是yield item
会立即将其返回为未完成填充。
我知道不可能等待所有yield Request(url,meta = {'item' : item}, callback= self.parse_player, priority = 300)
完成,但你怎么解决这个问题?即,当请求中的所有信息都已完成时,确保item
收益已完成。
from scrapy.spiders import Spider, CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request
from datetime import datetime
from footscript.items import MatchResultItem
import re, json, string, datetime, uuid
class PreliminarySpider(Spider):
name = "script"
start_urls = [
start_url1,
start_url2,
start_url3,
start_url4,
start_url5,
start_url6,
start_url7,
start_url8,
start_url9,
start_url10,
]
allowed_domains = ['domain.com']
def parse(self, response):
sel = Selector(response)
matches = sel.xpath('match_selector')
for match in matches:
try:
item = MatchResultItem()
item['url'] = match.xpath('match_url_extractor').extract()[0]
except Exception:
print "Unable to get: %s" % match.extract()
yield Request(url=item['url'] ,meta = {'item' : item}, callback=self.parse_additional_info)
def parse_additional_info(self, response):
item = response.request.meta['item']
sel = Selector(response)
try:
item['roun'] = sel.xpath('round_extractor').extract()[0]
item['stadium'] = sel.xpath('stadium_extractor').extract()[0]
item['attendance'] = sel.xpath('attendance_extractor').extract()[0]
except Exception:
print "Attributes not found at:" % item['url']
item['player'] = []
path_player = sel.xpath('path_extractor')
for path in path_player:
player = path.xpath('player_extractor').extract()[0]
player_id = path.xpath('player_d_extractor').extract()[0]
country = path.xpath('country_extractor').extract()[0]
item['player'].append([player_id, player, country])
url = path.xpath('url_extractor').extract()[0]
yield Request(url,meta = {'item' : item}, callback= self.parse_player, priority = 300)
# except Exception:
# print "Unable to get players"
yield item
def parse_player(self, response):
item = response.request.meta['item']
sel = Selector(response)
play_id = re.sub("[^0-9]", "",response.url)
name = sel.xpath('//div[@class="fdh-wrap contentheader"]/h1/text()').extract()[0].encode('utf-8').rstrip()
index = [i for i, row in enumerate(item['player']) if play_id in row[0]]
item['player'][index[0]][1]=name
return item
修改 新代码:
yield Request(url,meta = {'item' : item}, callback= self.parse_player, errback= self.err_player)
# except Exception:
# print "Unable to get players"
yield item
def parse_player(self, response):
item = response.request.meta['item']
sel = Selector(response)
play_id = re.sub("[^0-9]", "",response.url)
name = sel.xpath('//div[@class="fdh-wrap contentheader"]/h1/text()').extract()[0].encode('utf-8').rstrip()
index = [i for i, row in enumerate(item['player']) if play_id in row[0]]
item['player'][index[0]][1]=name
item['player'][index[0]].append("1")
return item
def err_player(self, response):
print "****************"
print "Player not found"
print "****************"
item = response.request.meta['item']
play_id = re.sub("[^0-9]", "",response.url)
index = [i for i, row in enumerate(item['player']) if play_id in row[0]]
item['player'][index[0]].append("1")
return item
答案 0 :(得分:2)
跨多个回调传递项目是非常微妙的练习。它可以在非常简单的情况下工作。 但是,您可以遇到各种问题:
Request(..., errback=self.my_parse_err)
修复它,但为每个请求创建2个回调非常繁琐)Request(...., dont_filter=True)
进行修复并使用HTTPCACHE_ENABLED=True
添加到settings.py
)从开发角度和生产角度来看,安全路径是为每种类型的页面创建一种类型的项目。然后将2个相关项目组合作为后期处理。
另请注意,如果您有重复的网址,您可能会在商品中找到重复的数据。这也会导致数据库中的数据规范化问题。