Question

刚开始玩scrapy有点帮助刮掉一些奇幻的篮球统计数据。我的主要问题在于我的蜘蛛 - 如何刮取链接的href属性，然后在该URL上回调另一个解析器？

我调查了link extractors，我认为这可能是我的解决方案，但我不确定。我一遍又一遍地重读它，仍然对从哪里开始感到困惑。以下是我到目前为止的代码。

def parse_player(self, response):
    player_name = "Steven Adams"
    sel = Selector(response)
    player_url = sel.xpath("//a[text()='%s']/@href" % player_name).extract()
    return Request("http://sports.yahoo.com/'%s'" % player_url, callback = self.parse_curr_stats)

def parse_curr_stats(self, response):
    sel = Selector(response)
    stats = sel.xpath("//div[@id='mediasportsplayercareerstats']//table[@summary='Player']/tbody/tr[last()-1]")
    items =[]

    for stat in stats:
        item = player_item()
        item['fgper'] = stat.xpath("td[@title='Field Goal Percentage']/text()").extract()
        item['ftper'] = stat.xpath("td[@title='Free Throw Percentage']/text()").extract()
        item['treys'] = stat.xpath("td[@title='3-point Shots Made']/text()").extract() 
        item['pts'] = stat.xpath("td[@title='Points']/text()").extract()
        item['reb'] = stat.xpath("td[@title='Total Rebounds']/text()").extract()
        item['ast'] = stat.xpath("td[@title='Assists']/text()").extract()
        item['stl'] = stat.xpath("td[@title='Steals']/text()").extract()
        item['blk'] = stat.xpath("td[@title='Blocked Shots']/text()").extract()
        item['tov'] = stat.xpath("td[@title='Turnovers']/text()").extract()
        item['fga'] = stat.xpath("td[@title='Field Goals Attempted']/text()").extract()
        item['fgm'] = stat.xpath("td[@title='Field Goals Made']/text()").extract()
        item['fta'] = stat.xpath("td[@title='Free Throws Attempted']/text()").extract()
        item['ftm'] = stat.xpath("td[@title='Free Throws Made']/text()").extract()
        items.append(item)
    return items

正如您所看到的，在第一个解析函数中，您将获得一个名称，并在页面上查找将引导您访问其单个页面的链接，该页面存储在“player_url”中。然后我如何进入该页面并在其上运行第二个解析器？

我觉得好像我完全掩饰某些东西，如果有人可以发光，我将不胜感激！

Answer 1

如果您想发送Request个对象，请使用yield而不是return，如下所示：

def parse_player(self, response):
    ...... 
    yield Request(......)

如果要在单个解析方法中发送许多Request对象，最佳实践是这样的：

def parse_player(self, response):
    ......
    res_objs = []
    # then add every Request object into 'res_objs' list,
    # and in the end of the method, do the following:
    for req in res_objs:
        yield req

我认为当scrapy蜘蛛正在运行时，它会像这样处理引擎下的请求：

# handle requests
for req_obj in self.parse_play():
    # do something with *Request* object

所以请记住使用 yield 来发送Request个对象。

如何在我第一次抓取的URL上请求回调？

1 个答案: