美丽的汤找不到第一个标签(XML)

时间:2017-04-17 02:05:36

标签: python xml beautifulsoup tags

我正在使用BeautifulSoup 4(和解析器lmxl)来解析用于MLB API的XML文件。 API会为特定日期的当前游戏生成记分板,而我在使用Beautiful Soup识别特定标签时遇到问题。

例如,我正在查看today's games,尝试根据away_file_codehome_file_code提取某个团队的分数和名称。如果我们看看Baltimore Orioles和Toronto Blue Jays,游戏记分牌XML将如下所示:

<games year="2017" month="04" day="16" modified_date="2017-04-17T01:42:57Z" next_day_date="2017-04-17">
<game id="2017/04/16/balmlb-tormlb-1" venue="Rogers Centre" game_pk="490271" time="1:07" time_date="2017/04/16 1:07" time_date_aw_lg="2017/04/16 1:07" time_date_hm_lg="2017/04/16 1:07" time_zone="ET" ampm="PM" first_pitch_et="" away_time="1:07" away_time_zone="ET" away_ampm="PM" home_time="1:07" home_time_zone="ET" home_ampm="PM" game_type="R" tiebreaker_sw="N" resume_date="" original_date="2017/04/16" time_zone_aw_lg="-4" time_zone_hm_lg="-4" time_aw_lg="1:07" aw_lg_ampm="PM" tz_aw_lg_gen="ET" time_hm_lg="1:07" hm_lg_ampm="PM" tz_hm_lg_gen="ET" venue_id="14" scheduled_innings="9" description="" away_name_abbrev="BAL" home_name_abbrev="TOR" away_code="bal" away_file_code="bal" away_team_id="110" away_team_city="Baltimore" away_team_name="Orioles" away_division="E" away_league_id="103" away_sport_code="mlb" home_code="tor" home_file_code="tor" home_team_id="141" home_team_city="Toronto" home_team_name="Blue Jays" home_division="E" home_league_id="103" home_sport_code="mlb" day="SUN" gameday_sw="P" double_header_sw="N" game_nbr="1" tbd_flag="N" away_games_back="-" home_games_back="6.5" away_games_back_wildcard="" home_games_back_wildcard="5.5" venue_w_chan_loc="CAXX0504" location="Toronto, Canada" gameday="2017_04_16_balmlb_tormlb_1" away_win="8" away_loss="3" home_win="2" home_loss="10" game_data_directory="/components/game/mlb/year_2017/month_04/day_16/gid_2017_04_16_balmlb_tormlb_1" league="AA">
<status status="Final" ind="F" reason="" inning="9" top_inning="N" b="0" s="0" o="3" inning_state="" note="" is_perfect_game="N" is_no_hitter="N"/>
<linescore>...</linescore>
<home_runs>...</home_runs>
<winning_pitcher id="605164" last="Bundy" first="Dylan" name_display_roster="Bundy" number="37" era="1.86" wins="2" losses="1"/>
<losing_pitcher id="457918" last="Happ" first="J.A." name_display_roster="Happ" number="33" era="4.50" wins="0" losses="3"/>
<save_pitcher id="" last="" first="" number="" name_display_roster="" era="0" wins="0" losses="0" saves="0" svo="0"/>
<links mlbtv="bam.media.launchPlayer({calendar_event_id:'14-490271-2017-04-16',media_type:'video'})" wrapup="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=wrap&c_id=mlb" home_audio="bam.media.launchPlayer({calendar_event_id:'14-490271-2017-04-16',media_type:'audio'})" away_audio="bam.media.launchPlayer({calendar_event_id:'14-490271-2017-04-16',media_type:'audio'})" home_preview="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=preview&c_id=mlb" away_preview="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=preview&c_id=mlb" preview="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=preview&c_id=mlb" tv_station="SNET-1"/>
<broadcast>...</broadcast>
<alerts text="Final score in Toronto: Baltimore 11, Toronto 4" brief_text="At TOR: Final - BAL 11, TOR 4" type="status"/>
<game_media>...</game_media>
<video_thumbnail>...</video_thumbnail>
<video_thumbnails>...</video_thumbnails>
</game>
<game>...</game> (etc...)

以下是我用来尝试查找game(非games)标记及其属性的代码片段。问题是,当我请求游戏时,它返回None。但是,我可以在没有问题的情况下返回任何其他标记 - 例如,status完全正常。

soup = BeautifulSoup(webpage, 'xml') # webpage is the xml file for today's games
tags = soup.findAll('game', {'home_file_code': 'tor'}) #supposed to find the tags for the home_file_code matching the home team's abbreviation
for games in tags:
    print(games.find('status')['status'] #works without an issue
    print(games.find('game')['home_file_code'] #throws below error, because games.find('game') is None

TypeError:'NoneType'对象不可订阅

此外,如果我打印列表中的子项(print(list(games.children))),它将返回除游戏之外的所有内容。

关于为什么它无法获取第一个标记,我是否缺少关于XML的东西?我很困惑,因为这不久前对我有用,而且我不确定我改变了什么导致错误。

2 个答案:

答案 0 :(得分:0)

看来我误解了find函数。您可以为关键字编制索引,以便在标记本身内查找所需的属性。所以,基本上我应该做以下事情:

soup = BeautifulSoup(webpage, 'xml') # webpage is the xml file for today's games
tags = soup.findAll('game', {'home_file_code': 'tor'})
for games in tags:
    print(games.find('status')['status']
    print(games['home_file_code'])

现在print(games['home_file_code']会按预期找到home_file_code,因为它已经存在于我们查找的代码中。

我确定有人可以给出更彻底的答案,但那是我所遇到的根本误解。

答案 1 :(得分:0)

我不是最伟大的程序员,但我很确定您没有找到第一个标签,因为它定义不正确。 XML 标记(如果它们包含任何内容)必须具有如下所示的开头和结尾部分: <games>year="2017" month="04" day="16"</games> 而不是这样: <games year="2017" month="04" day="16"> 因此,您首先需要修复 XML 格式,然后从那里开始使用。