如何在我第一次抓取的URL上请求回调?

时间:2013-12-08 01:20:07

标签: python web-scraping scrapy

刚开始玩scrapy有点帮助刮掉一些奇幻的篮球统计数据。我的主要问题在于我的蜘蛛 - 如何刮取链接的href属性,然后在该URL上回调另一个解析器?

我调查了link extractors,我认为这可能是我的解决方案,但我不确定。我一遍又一遍地重读它,仍然对从哪里开始感到困惑。以下是我到目前为止的代码。

def parse_player(self, response):
    player_name = "Steven Adams"
    sel = Selector(response)
    player_url = sel.xpath("//a[text()='%s']/@href" % player_name).extract()
    return Request("http://sports.yahoo.com/'%s'" % player_url, callback = self.parse_curr_stats)

def parse_curr_stats(self, response):
    sel = Selector(response)
    stats = sel.xpath("//div[@id='mediasportsplayercareerstats']//table[@summary='Player']/tbody/tr[last()-1]")
    items =[]

    for stat in stats:
        item = player_item()
        item['fgper'] = stat.xpath("td[@title='Field Goal Percentage']/text()").extract()
        item['ftper'] = stat.xpath("td[@title='Free Throw Percentage']/text()").extract()
        item['treys'] = stat.xpath("td[@title='3-point Shots Made']/text()").extract() 
        item['pts'] = stat.xpath("td[@title='Points']/text()").extract()
        item['reb'] = stat.xpath("td[@title='Total Rebounds']/text()").extract()
        item['ast'] = stat.xpath("td[@title='Assists']/text()").extract()
        item['stl'] = stat.xpath("td[@title='Steals']/text()").extract()
        item['blk'] = stat.xpath("td[@title='Blocked Shots']/text()").extract()
        item['tov'] = stat.xpath("td[@title='Turnovers']/text()").extract()
        item['fga'] = stat.xpath("td[@title='Field Goals Attempted']/text()").extract()
        item['fgm'] = stat.xpath("td[@title='Field Goals Made']/text()").extract()
        item['fta'] = stat.xpath("td[@title='Free Throws Attempted']/text()").extract()
        item['ftm'] = stat.xpath("td[@title='Free Throws Made']/text()").extract()
        items.append(item)
    return items

正如您所看到的,在第一个解析函数中,您将获得一个名称,并在页面上查找将引导您访问其单个页面的链接,该页面存储在“player_url”中。然后我如何进入该页面并在其上运行第二个解析器?

我觉得好像我完全掩饰某些东西,如果有人可以发光,我将不胜感激!

1 个答案:

答案 0 :(得分:0)

如果您想发送Request个对象,请使用yield而不是return,如下所示:

def parse_player(self, response):
    ...... 
    yield Request(......)

如果要在单个解析方法中发送许多Request对象,最佳实践是这样的:

def parse_player(self, response):
    ......
    res_objs = []
    # then add every Request object into 'res_objs' list,
    # and in the end of the method, do the following:
    for req in res_objs:
        yield req

我认为当scrapy蜘蛛正在运行时,它会像这样处理引擎下的请求:

# handle requests
for req_obj in self.parse_play():
    # do something with *Request* object

所以请记住使用 yield 来发送Request个对象。