我试图将retrosheet boxscore生成的xml文件转换为可以插入到sql表中的数据框。我大部分时间都在那里,但我无法弄清楚如何抓取中间xml节点的属性。下面是一个例子,希望我正确粘贴它。我想要抓住的是game_id,id(来自玩家)以及完整的击球部分。
<boxscores>
<boxscore game_id="CHA191204110" date="1912/04/11" site="CHI10"
visitor="SLA" visitor_city="St.Louis" visitor_name="Browns" home="CHA"
home_city="Chicago" home_name="White Sox" start_time="0:00PM"
day_night="day" temperature="0" wind_direction="unknown" wind_speed="-1"
field_condition="unknown" precip="unknown" sky="unknown" time_of_game="110"
attendance="30000" umpire_hp="evanb901" umpire_1b="eganr101" umpire_2b=""
umpire_3b="" >
<linescore away_runs="2" away_hits="7" away_errors="1" home_runs="6"
home_hits="10" home_errors="1">
<inning_line_score away="0" home="0" inning="1"/>
<inning_line_score away="0" home="0" inning="2"/>
<inning_line_score away="0" home="1" inning="3"/>
<inning_line_score away="0" home="0" inning="4"/>
<inning_line_score away="2" home="0" inning="5"/>
<inning_line_score away="0" home="1" inning="6"/>
<inning_line_score away="0" home="1" inning="7"/>
<inning_line_score away="0" home="3" inning="8"/>
<inning_line_score away="0" home="x" inning="9"/>
</linescore>
<players team="SLA" lob="5" dp="0" tp="0" risp_ab="0" risp_h="0">
<player id="shotb101" lname="Shotton" fname="Burt" slot="1" seq="1" pos="8">
<batting ab="4" r="0" h="0" d="0" t="0" hr="0" bi="0" bi2out="-1" bb="0" ibb="-1" so="3" gdp="-1" hp="0" sh="0" sf="-1" sb="0" cs="-1" />
<fielding pos="8" outs="24" po="1" a="0" e="0" dp="0" tp="0" bip="-1" bf="-1" />
</player>
<player id="austj101" lname="Austin" fname="Jimmy" slot="2" seq="1" pos="5">
<batting ab="4" r="0" h="1" d="0" t="0" hr="0" bi="0" bi2out="-1" bb="0" ibb="-1" so="1" gdp="-1" hp="0" sh="0" sf="-1" sb="0" cs="-1" />
<fielding pos="5" outs="24" po="0" a="3" e="0" dp="0" tp="0" bip="-1" bf="-1" />
</player>
<player id="stovg101" lname="Stovall" fname="George" slot="3" seq="1" pos="3" >
<batting ab="4" r="0" h="1" d="0" t="0" hr="0" bi="0" bi2out="-1" bb="0" ibb="-1" so="0" gdp="-1" hp="0" sh="0" sf="-1" sb="0" cs="-1" />
<fielding pos="3" outs="24" po="11" a="0" e="0" dp="0" tp="0" bip="-1" bf="-1" />
</player>
</players>
</boxscore>
</boxscores>
以下是我使用
的代码box <-
read_xml("Q:\\Sabermetrics\\Retrosheet\\download.folder\\unzipped\\1912.xml")
atbat <- xml_find_all(box, "//boxscore")
bind_rows(lapply(atbat, function(x) {
player <- try(xml_find_all(x, "./players/player/batting"), silent=FALSE)
if (inherits(player, "try-error") |
length(player) == 0) return(NULL)
bind_rows(lapply(player, function(y) {
data.frame(t(xml_attrs(y)), stringsAsFactors=FALSE)
})) -> player_dat
game_id <- try(xml_attr(x, "game_id"))
if (inherits(game_id, "try-error") |
length(game_id) == 0) return(NULL)
player_dat$game_id <- game_id
player_dat
})) -> player
我想最终得到像这样的东西
game_id player_id ab r h d ....
CHA191204110 shotb101 4 0 0 0 ....
CHA191204110 austj101 4 0 1 0 ....
CHA191204110 stovg101 4 0 0 0 ....
我已经尝试复制game_id代码并抓住了“id&#39;来自玩家,但它不起作用。我尝试过使用路径./players/player[@id]和./players/player/@id这两种方法都没有用。我尝试过使用@id,仍然是NA。
我不确定自己做错了什么,而且我只是把东西扔在墙上看它是否坚持......
答案 0 :(得分:0)
这对你有帮助吗?
xml <- xmlParse('Q:\\Sabermetrics\\Retrosheet\\download.folder\\unzipped\\1912.xml')
lxml <- xmlToList(xml)
df <- cbind(t(lxml$boxscore$.attrs),t(data.frame(unlist(lxml$boxscore$players))))
您可以通过向cbind()
传递更多参数来从xml中提取其他信息。
我认为你正在迭代多个xmls,所以原则上你可以将这样的东西包装成sapply()
然后通过执行:library(plyr);do.call(rbind.fill, your_df_list)
将所有东西收集到一个df中。