我正在尝试解析锦标赛中各种运动队的推特页面。要解析Twitter,我首先必须从具有指向所有其他锦标赛链接的网页开始,然后转到包含该锦标赛所有团队的网页,然后转到团队网页以获取Twitter。当我进入团队网页时遇到麻烦,因为我不确定如何将Twitter名称返回到以前的回调函数,因此我可以将该锦标赛中的所有Twitter名称放入列表中。
在我最后的回调函数parse_twitter中,我尝试将结果作为字典返回,然后将其添加到parse_schedule中的项目中,但是运气不高
def parse(self, response):
# Get list of tournaments
tournaments = Selector(response).xpath('//td/a')
del tournaments[0]
# Go through each tournament
for tourney in tournaments:
item = FrisbeeItem()
item['tournament_name'] = tourney.xpath('./text()').extract()[0]
item['tournament_url'] = tourney.xpath('./@href').extract()[0]
# make the URL to the teams in the tournament
tournament_schedule = item['tournament_url'] + '/schedule/Men/CollegeMen/'
# Request to tournament page
yield scrapy.Request(url=tournament_schedule, callback=self.parse_schedule, meta={'item' : item})
def parse_schedule(self, response):
item = response.meta.get('item')
# Get the list of teams
tourney_teams = Selector(response).xpath('//div[@class = "pool"]//td/a')
# For each team in the tournament, get name and URL to team page
for team in tourney_teams:
team_name = team.xpath('./text()').extract()[0]
team_url = 'https://play.usaultimate.org/' + team.xpath('./@href').extract()[0]
# Request to team page
yield scrapy.Request(url=team_url, callback=self.parse_twitter, meta={'item': item, 'team_name': team_name})
def parse_twitter(self, response):
item = response.meta.get('item')
team_name = response.meta.get('team_name')
result = {}
# Get the list containing the twitter
team_twitter = Selector(response).xpath('//dl[@id="CT_Main_0_dlTwitter"]//a/text()').extract()
#If a twitter is not listed, put empty string
if len(team_twitter) == 0:
result = {'name': team_name, 'twitter': ''}
else:
result = {'name': team_name, 'twitter': team_twitter[0]}
item['tournament_teams'] = result
yield item
我想要它的格式接近以下格式:
{'tournament_name: X,
'teams': [{'team_name': team1, 'twitter_name': twitter1},
{'team_name': team2, 'twitter_name': twitter2},
{'team_name': team3, 'twitter_name': twitter3},
...]
}
{'tournament_name: Y,
'teams': [{'team_name': team1, 'twitter_name': twitter1},
{'team_name': team2, 'twitter_name': twitter2},
{'team_name': team3, 'twitter_name': twitter3},
...]
}
因此,基本上每个锦标赛只有一项,其中包含该锦标赛中每个团队的名称和推特。
现在,使用我列出的代码,它为每个团队网页吐出1个项目(在每个锦标赛中为每个团队吐出一个项目)