刚开始玩scrapy有点帮助刮掉一些奇幻的篮球统计数据。我的主要问题在于我的蜘蛛 - 如何刮取链接的href属性,然后在该URL上回调另一个解析器?
我调查了link extractors,我认为这可能是我的解决方案,但我不确定。我一遍又一遍地重读它,仍然对从哪里开始感到困惑。以下是我到目前为止的代码。
def parse_player(self, response):
player_name = "Steven Adams"
sel = Selector(response)
player_url = sel.xpath("//a[text()='%s']/@href" % player_name).extract()
return Request("http://sports.yahoo.com/'%s'" % player_url, callback = self.parse_curr_stats)
def parse_curr_stats(self, response):
sel = Selector(response)
stats = sel.xpath("//div[@id='mediasportsplayercareerstats']//table[@summary='Player']/tbody/tr[last()-1]")
items =[]
for stat in stats:
item = player_item()
item['fgper'] = stat.xpath("td[@title='Field Goal Percentage']/text()").extract()
item['ftper'] = stat.xpath("td[@title='Free Throw Percentage']/text()").extract()
item['treys'] = stat.xpath("td[@title='3-point Shots Made']/text()").extract()
item['pts'] = stat.xpath("td[@title='Points']/text()").extract()
item['reb'] = stat.xpath("td[@title='Total Rebounds']/text()").extract()
item['ast'] = stat.xpath("td[@title='Assists']/text()").extract()
item['stl'] = stat.xpath("td[@title='Steals']/text()").extract()
item['blk'] = stat.xpath("td[@title='Blocked Shots']/text()").extract()
item['tov'] = stat.xpath("td[@title='Turnovers']/text()").extract()
item['fga'] = stat.xpath("td[@title='Field Goals Attempted']/text()").extract()
item['fgm'] = stat.xpath("td[@title='Field Goals Made']/text()").extract()
item['fta'] = stat.xpath("td[@title='Free Throws Attempted']/text()").extract()
item['ftm'] = stat.xpath("td[@title='Free Throws Made']/text()").extract()
items.append(item)
return items
正如您所看到的,在第一个解析函数中,您将获得一个名称,并在页面上查找将引导您访问其单个页面的链接,该页面存储在“player_url”中。然后我如何进入该页面并在其上运行第二个解析器?
我觉得好像我完全掩饰某些东西,如果有人可以发光,我将不胜感激!
答案 0 :(得分:0)
如果您想发送Request
个对象,请使用yield
而不是return
,如下所示:
def parse_player(self, response):
......
yield Request(......)
如果要在单个解析方法中发送许多Request对象,最佳实践是这样的:
def parse_player(self, response):
......
res_objs = []
# then add every Request object into 'res_objs' list,
# and in the end of the method, do the following:
for req in res_objs:
yield req
我认为当scrapy蜘蛛正在运行时,它会像这样处理引擎下的请求:
# handle requests
for req_obj in self.parse_play():
# do something with *Request* object
所以请记住使用 yield 来发送Request
个对象。