返回数据到上一个回调函数?

时间:2019-06-11 18:39:59

标签: python web-scraping scrapy web-crawler

我正在尝试解析锦标赛中各种运动队的推特页面。要解析Twitter,我首先必须从具有指向所有其他锦标赛链接的网页开始,然后转到包含该锦标赛所有团队的网页,然后转到团队网页以获取Twitter。当我进入团队网页时遇到麻烦,因为我不确定如何将Twitter名称返回到以前的回调函数,因此我可以将该锦标赛中的所有Twitter名称放入列表中。

在我最后的回调函数parse_twitter中,我尝试将结果作为字典返回,然后将其添加到parse_schedule中的项目中,但是运气不高

def parse(self, response):
    # Get list of tournaments
    tournaments = Selector(response).xpath('//td/a')
    del tournaments[0]

    # Go through each tournament
    for tourney in tournaments:
        item = FrisbeeItem()
        item['tournament_name'] = tourney.xpath('./text()').extract()[0]
        item['tournament_url'] = tourney.xpath('./@href').extract()[0]

        # make the URL to the teams in the tournament
        tournament_schedule = item['tournament_url'] + '/schedule/Men/CollegeMen/'

        # Request to tournament page
        yield scrapy.Request(url=tournament_schedule, callback=self.parse_schedule, meta={'item' : item})

def parse_schedule(self, response):
    item = response.meta.get('item')

    # Get the list of teams
    tourney_teams = Selector(response).xpath('//div[@class = "pool"]//td/a')

    # For each team in the tournament, get name and URL to team page
    for team in tourney_teams:
        team_name = team.xpath('./text()').extract()[0]
        team_url = 'https://play.usaultimate.org/' + team.xpath('./@href').extract()[0]

        # Request to team page
        yield scrapy.Request(url=team_url, callback=self.parse_twitter, meta={'item': item, 'team_name': team_name})



def parse_twitter(self, response):
    item = response.meta.get('item')
    team_name = response.meta.get('team_name')

    result = {}
    # Get the list containing the twitter
    team_twitter = Selector(response).xpath('//dl[@id="CT_Main_0_dlTwitter"]//a/text()').extract()

    #If a twitter is not listed, put empty string
    if len(team_twitter) == 0:
        result = {'name': team_name, 'twitter': ''}
    else:
        result = {'name': team_name, 'twitter': team_twitter[0]}

    item['tournament_teams'] = result

    yield item

我想要它的格式接近以下格式:

    {'tournament_name: X,
     'teams': [{'team_name': team1, 'twitter_name': twitter1},
               {'team_name': team2, 'twitter_name': twitter2},
               {'team_name': team3, 'twitter_name': twitter3},
               ...]
     }
    {'tournament_name: Y,
     'teams': [{'team_name': team1, 'twitter_name': twitter1},
               {'team_name': team2, 'twitter_name': twitter2},
               {'team_name': team3, 'twitter_name': twitter3},
               ...]
     }

因此,基本上每个锦标赛只有一项,其中包含该锦标赛中每个团队的名称和推特。

现在,使用我列出的代码,它为每个团队网页吐出1个项目(在每个锦标赛中为每个团队吐出一个项目)

0 个答案:

没有答案