解析麻烦的美丽的汤

时间:2017-06-02 00:40:31

标签: python web-scraping beautifulsoup

我无法解析这个似乎没有任何类引用的xml。

我的代码片段:

sock = urllib2.urlopen(l)
link = sock.read()

soup = BeautifulSoup(link,"xml")

FirstNameHome=soup.find('home_probable_pitcher','first_name')

我想找到主队和客队的名字:

(Theres只有两个实例,所以不确定我是否应该使用findAll

以下是使用soup.prettify

的来源
 LookupError: unknown encoding: <?xml version="1.0" encoding="UTF-8"?><!--Copyright 2017 MLB Advanced Media, L.P.  Use of any content on this page acknowledges agreement to the terms posted here http://gdx.mlb.com/components/copyright.txt-->
<game id="2017/06/02/nyamlb-tormlb-1" venue="Rogers Centre" game_pk="490921"
      time="7:07"
      time_date="2017/06/02 7:07"
      time_date_aw_lg="2017/06/02 7:07"
      time_date_hm_lg="2017/06/02 7:07"
      time_zone="ET"
      ampm="PM"
      first_pitch_et=""
      away_time="7:07"
      away_time_zone="ET"
      away_ampm="PM"
      home_time="7:07"
      home_time_zone="ET"
      home_ampm="PM"
      game_type="R"
      tiebreaker_sw="N"
      original_date="2017/06/02"
      time_zone_aw_lg="-4"
      time_zone_hm_lg="-4"
      time_aw_lg="7:07"
      aw_lg_ampm="PM"
      tz_aw_lg_gen="ET"
      time_hm_lg="7:07"
      hm_lg_ampm="PM"
      tz_hm_lg_gen="ET"
      venue_id="14"
      scheduled_innings="9"
      away_name_abbrev="NYY"
      home_name_abbrev="TOR"
      away_code="nya"
      away_file_code="nyy"
      away_team_id="147"
      away_team_city="NY Yankees"
      away_team_name="Yankees"
      away_division="E"
      away_league_id="103"
      away_sport_code="mlb"
      home_code="tor"
      home_file_code="tor"
      home_team_id="141"
      home_team_city="Toronto"
      home_team_name="Blue Jays"
      home_division="E"
      home_league_id="103"
      home_sport_code="mlb"
      day="FRI"
      gameday_sw="P"
      double_header_sw="N"
      game_nbr="1"
      tbd_flag="N"
      venue_w_chan_loc="CAXX0504"
      location="Toronto, Canada"
      gameday_link="2017_06_02_nyamlb_tormlb_1"
      away_win="30"
      away_loss="20"
      home_win="26"
      home_loss="27"
      game_data_directory="/components/game/mlb/year_2017/month_06/day_02/gid_2017_06_02_nyamlb_tormlb_1"
      league="AA"
      inning_state=""
      note=""
      status="Preview"
      ind="S"
      tv_station="SNET-1, MLBN (out-of-market only)">
   <home_probable_pitcher id="434538" first_name="Francisco" first="Francisco" last_name="Liriano"
                          last="Liriano"
                          name_display_roster="Liriano"
                          number="45"
                          throwinghand="LHP"
                          wins="2"
                          losses="2"
                          era="6.35"
                          s_wins="2"
                          s_losses="2"
                          s_era="6.35"
                          stats_season="2017"
                          stats_type="R"/>
   <away_probable_pitcher id="501381" first_name="Michael" first="Michael" last_name="Pineda"
                          last="Pineda"
                          name_display_roster="Pineda"
                          number="35"
                          throwinghand="RHP"
                          wins="6"
                          losses="2"
                          era="3.32"
                          s_wins="6"
                          s_losses="2"
                          s_era="3.32"
                          stats_season="2017"
                          stats_type="R"/>
   <game_media>
      <media type="game" calendar_event_id="14-490921-2017-06-02"
             start="2017-06-02T19:07:00-0400"
             title="NYY @ TOR"
             has_mlbtv="true"
             free="NO"
             enhanced="N"
             media_state="media_off"
             thumbnail="http://mediadownloads.mlb.com/mlbam/preview/nyator_490921_th_7_preview.jpg"/>
   </game_media>
</game>

1 个答案:

答案 0 :(得分:3)

如果我们写

# for Python 3
# import urllib.request

import urllib2

from bs4 import BeautifulSoup

l = 'http://gd2.mlb.com/components/game/mlb/year_2017/month_06/day_03/gid_2017_06_03_arimlb_miamlb_1/linescore.xml'

sock = urllib2.urlopen(l)
# for Python 3
# sock = urllib.request.urlopen(l)
link = sock.read()

soup = BeautifulSoup(link, "xml")

FirstNameHome = soup.find('home_probable_pitcher').attrs['first_name']
print(FirstNameHome)

它给出了

Edinson

print(soup.prettify(encoding='utf-8'))

给出

<?xml version="1.0" encoding="utf-8"?>
<!--Copyright 2017 MLB Advanced Media, L.P.  Use of any content on this page acknowledges agreement to the terms posted here http://gdx.mlb.com/components/copyright.txt-->
<game ampm="PM" aw_lg_ampm="PM" away_ampm="PM" away_code="ari" away_division="W" away_file_code="ari" away_league_id="104" away_loss="22" away_name_abbrev="ARI" away_sport_code="mlb" away_team_city="Arizona" away_team_id="109" away_team_name="D-backs" away_time="1:10" away_time_zone="MST" away_win="34" day="SAT" double_header_sw="N" first_pitch_et="" game_data_directory="/components/game/mlb/year_2017/month_06/day_03/gid_2017_06_03_arimlb_miamlb_1" game_nbr="1" game_pk="490927" game_type="R" gameday_link="2017_06_03_arimlb_miamlb_1" gameday_sw="P" hm_lg_ampm="PM" home_ampm="PM" home_code="mia" home_division="E" home_file_code="mia" home_league_id="104" home_loss="31" home_name_abbrev="MIA" home_sport_code="mlb" home_team_city="Miami" home_team_id="146" home_team_name="Marlins" home_time="4:10" home_time_zone="ET" home_win="21" id="2017/06/03/arimlb-miamlb-1" ind="S" inning_state="" league="NN" location="Miami, FL" note="" original_date="2017/06/03" scheduled_innings="9" status="Preview" tbd_flag="N" tiebreaker_sw="N" time="4:10" time_aw_lg="4:10" time_date="2017/06/03 4:10" time_date_aw_lg="2017/06/03 4:10" time_date_hm_lg="2017/06/03 4:10" time_hm_lg="4:10" time_zone="ET" time_zone_aw_lg="-4" time_zone_hm_lg="-4" tv_station="FS-F, MLBN (out-of-market only)" tz_aw_lg_gen="ET" tz_hm_lg_gen="ET" venue="Marlins Park" venue_id="4169" venue_w_chan_loc="USFL0316">
 <home_probable_pitcher era="4.44" first="Edinson" first_name="Edinson" id="450172" last="Volquez" last_name="Volquez" losses="7" name_display_roster="Volquez" number="36" s_era="4.44" s_losses="7" s_wins="1" stats_season="2017" stats_type="R" throwinghand="RHP" wins="1"/>
 <away_probable_pitcher era="3.47" first="Randall" first_name="Randall" id="517414" last="Delgado" last_name="Delgado" losses="0" name_display_roster="Delgado" number="48" s_era="3.47" s_losses="0" s_wins="1" stats_season="2017" stats_type="R" throwinghand="RHP" wins="1"/>
 <game_media>
  <media calendar_event_id="14-490927-2017-06-03" enhanced="N" free="NO" has_mlbtv="true" media_state="media_off" start="2017-06-03T16:10:00-0400" thumbnail="http://mediadownloads.mlb.com/mlbam/preview/arimia_490927_th_7_preview.jpg" title="ARI @ MIA" type="game"/>
 </game_media>
</game>

修改

只有当我将link对象(或str(soup))传递给prettify方法时,我才能重现您的错误

soup.prettify(link)

嗯,这不是您所需要的,因为prettify参数可以是encoding(例如'utf-8')和formatter(默认为'minimal') ,不是原始内容,所以只需写下

pretty = soup.prettify()

它会给出

>>> type(pretty)
<type 'unicode'>

或指定编码

>>> pretty = soup.prettify(encoding='utf-8')

它会给出

>>> type(pretty)
<type 'str'>