我无法解析这个似乎没有任何类引用的xml。
我的代码片段:
sock = urllib2.urlopen(l)
link = sock.read()
soup = BeautifulSoup(link,"xml")
FirstNameHome=soup.find('home_probable_pitcher','first_name')
我想找到主队和客队的名字:
(Theres只有两个实例,所以不确定我是否应该使用findAll
)
以下是使用soup.prettify
LookupError: unknown encoding: <?xml version="1.0" encoding="UTF-8"?><!--Copyright 2017 MLB Advanced Media, L.P. Use of any content on this page acknowledges agreement to the terms posted here http://gdx.mlb.com/components/copyright.txt-->
<game id="2017/06/02/nyamlb-tormlb-1" venue="Rogers Centre" game_pk="490921"
time="7:07"
time_date="2017/06/02 7:07"
time_date_aw_lg="2017/06/02 7:07"
time_date_hm_lg="2017/06/02 7:07"
time_zone="ET"
ampm="PM"
first_pitch_et=""
away_time="7:07"
away_time_zone="ET"
away_ampm="PM"
home_time="7:07"
home_time_zone="ET"
home_ampm="PM"
game_type="R"
tiebreaker_sw="N"
original_date="2017/06/02"
time_zone_aw_lg="-4"
time_zone_hm_lg="-4"
time_aw_lg="7:07"
aw_lg_ampm="PM"
tz_aw_lg_gen="ET"
time_hm_lg="7:07"
hm_lg_ampm="PM"
tz_hm_lg_gen="ET"
venue_id="14"
scheduled_innings="9"
away_name_abbrev="NYY"
home_name_abbrev="TOR"
away_code="nya"
away_file_code="nyy"
away_team_id="147"
away_team_city="NY Yankees"
away_team_name="Yankees"
away_division="E"
away_league_id="103"
away_sport_code="mlb"
home_code="tor"
home_file_code="tor"
home_team_id="141"
home_team_city="Toronto"
home_team_name="Blue Jays"
home_division="E"
home_league_id="103"
home_sport_code="mlb"
day="FRI"
gameday_sw="P"
double_header_sw="N"
game_nbr="1"
tbd_flag="N"
venue_w_chan_loc="CAXX0504"
location="Toronto, Canada"
gameday_link="2017_06_02_nyamlb_tormlb_1"
away_win="30"
away_loss="20"
home_win="26"
home_loss="27"
game_data_directory="/components/game/mlb/year_2017/month_06/day_02/gid_2017_06_02_nyamlb_tormlb_1"
league="AA"
inning_state=""
note=""
status="Preview"
ind="S"
tv_station="SNET-1, MLBN (out-of-market only)">
<home_probable_pitcher id="434538" first_name="Francisco" first="Francisco" last_name="Liriano"
last="Liriano"
name_display_roster="Liriano"
number="45"
throwinghand="LHP"
wins="2"
losses="2"
era="6.35"
s_wins="2"
s_losses="2"
s_era="6.35"
stats_season="2017"
stats_type="R"/>
<away_probable_pitcher id="501381" first_name="Michael" first="Michael" last_name="Pineda"
last="Pineda"
name_display_roster="Pineda"
number="35"
throwinghand="RHP"
wins="6"
losses="2"
era="3.32"
s_wins="6"
s_losses="2"
s_era="3.32"
stats_season="2017"
stats_type="R"/>
<game_media>
<media type="game" calendar_event_id="14-490921-2017-06-02"
start="2017-06-02T19:07:00-0400"
title="NYY @ TOR"
has_mlbtv="true"
free="NO"
enhanced="N"
media_state="media_off"
thumbnail="http://mediadownloads.mlb.com/mlbam/preview/nyator_490921_th_7_preview.jpg"/>
</game_media>
</game>
答案 0 :(得分:3)
如果我们写
# for Python 3
# import urllib.request
import urllib2
from bs4 import BeautifulSoup
l = 'http://gd2.mlb.com/components/game/mlb/year_2017/month_06/day_03/gid_2017_06_03_arimlb_miamlb_1/linescore.xml'
sock = urllib2.urlopen(l)
# for Python 3
# sock = urllib.request.urlopen(l)
link = sock.read()
soup = BeautifulSoup(link, "xml")
FirstNameHome = soup.find('home_probable_pitcher').attrs['first_name']
print(FirstNameHome)
它给出了
Edinson
也
print(soup.prettify(encoding='utf-8'))
给出
<?xml version="1.0" encoding="utf-8"?>
<!--Copyright 2017 MLB Advanced Media, L.P. Use of any content on this page acknowledges agreement to the terms posted here http://gdx.mlb.com/components/copyright.txt-->
<game ampm="PM" aw_lg_ampm="PM" away_ampm="PM" away_code="ari" away_division="W" away_file_code="ari" away_league_id="104" away_loss="22" away_name_abbrev="ARI" away_sport_code="mlb" away_team_city="Arizona" away_team_id="109" away_team_name="D-backs" away_time="1:10" away_time_zone="MST" away_win="34" day="SAT" double_header_sw="N" first_pitch_et="" game_data_directory="/components/game/mlb/year_2017/month_06/day_03/gid_2017_06_03_arimlb_miamlb_1" game_nbr="1" game_pk="490927" game_type="R" gameday_link="2017_06_03_arimlb_miamlb_1" gameday_sw="P" hm_lg_ampm="PM" home_ampm="PM" home_code="mia" home_division="E" home_file_code="mia" home_league_id="104" home_loss="31" home_name_abbrev="MIA" home_sport_code="mlb" home_team_city="Miami" home_team_id="146" home_team_name="Marlins" home_time="4:10" home_time_zone="ET" home_win="21" id="2017/06/03/arimlb-miamlb-1" ind="S" inning_state="" league="NN" location="Miami, FL" note="" original_date="2017/06/03" scheduled_innings="9" status="Preview" tbd_flag="N" tiebreaker_sw="N" time="4:10" time_aw_lg="4:10" time_date="2017/06/03 4:10" time_date_aw_lg="2017/06/03 4:10" time_date_hm_lg="2017/06/03 4:10" time_hm_lg="4:10" time_zone="ET" time_zone_aw_lg="-4" time_zone_hm_lg="-4" tv_station="FS-F, MLBN (out-of-market only)" tz_aw_lg_gen="ET" tz_hm_lg_gen="ET" venue="Marlins Park" venue_id="4169" venue_w_chan_loc="USFL0316">
<home_probable_pitcher era="4.44" first="Edinson" first_name="Edinson" id="450172" last="Volquez" last_name="Volquez" losses="7" name_display_roster="Volquez" number="36" s_era="4.44" s_losses="7" s_wins="1" stats_season="2017" stats_type="R" throwinghand="RHP" wins="1"/>
<away_probable_pitcher era="3.47" first="Randall" first_name="Randall" id="517414" last="Delgado" last_name="Delgado" losses="0" name_display_roster="Delgado" number="48" s_era="3.47" s_losses="0" s_wins="1" stats_season="2017" stats_type="R" throwinghand="RHP" wins="1"/>
<game_media>
<media calendar_event_id="14-490927-2017-06-03" enhanced="N" free="NO" has_mlbtv="true" media_state="media_off" start="2017-06-03T16:10:00-0400" thumbnail="http://mediadownloads.mlb.com/mlbam/preview/arimia_490927_th_7_preview.jpg" title="ARI @ MIA" type="game"/>
</game_media>
</game>
只有当我将link
对象(或str(soup)
)传递给prettify
方法时,我才能重现您的错误
soup.prettify(link)
嗯,这不是您所需要的,因为prettify
参数可以是encoding
(例如'utf-8'
)和formatter
(默认为'minimal'
) ,不是原始内容,所以只需写下
pretty = soup.prettify()
它会给出
>>> type(pretty)
<type 'unicode'>
或指定编码
>>> pretty = soup.prettify(encoding='utf-8')
它会给出
>>> type(pretty)
<type 'str'>