我试图攻击一系列HTML表格(来自具有相同列名的不同页面),但有些页面没有记录" ,我想跳过这些页面或为数据帧分配NULL。
示例Dataframe 1
url="http://stats.espncricinfo.com/ci/engine/player/28081.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=1;template=results;type=batting;view=match"
Batting=readHTMLTable(url)
Batting$"Match by match list"
Batting<-Batting$"Match by match list"
Dataframe 2
url="http://stats.espncricinfo.com/ci/engine/player/625383.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=2;template=results;type=batting;view=match"
Batting=readHTMLTable(url)
Batting$"Match by match list"
Batting<-Batting$"Match by match list"
有几个这样的数据框有表格形式的记录,有些没有记录
当我绑定没有记录的那个导致最终数据帧出错时
final_DF<-rbind(Dataframe1,Dataframe2)
我该如何解决这个问题??
PS:对于每个网址查询,我根据我对数据框的要求添加了一组列(比如使用cbind添加了5个列)。
答案 0 :(得分:0)
您可以执行以下操作:
require(rvest)
require(tidyverse)
urls <- c(
"http://stats.espncricinfo.com/ci/engine/player/28081.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=1;template=results;type=batting;view=match",
"http://stats.espncricinfo.com/ci/engine/player/625383.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=2;template=results;type=batting;view=match"
)
extra_cols <- list(
tibble("Team"="IND","Player"="MS.Dhoni","won"=1,"lost"=0,"D"=1,"D/N"=0,"innings"=1,"Format"="ODI"),
tibble("Team"="IND","Player"="MS.Dhoni","won"=1,"lost"=0,"D"=1,"D/N"=0,"innings"=1,"Format"="ODI")
)
doc <- map(urls, read_html) %>%
map(html_node, ".engineTable:nth-child(5)")
keep <- map_lgl(doc, ~class(.) != "xml_missing")
map(doc[keep], html_table, fill = TRUE) %>%
map2_df(extra_cols[keep], cbind)
关键部分是discard
,它删除了类“xml_missing”的所有列表元素,例如空的。
我与你的代码进行比较我使用CSS选择器来指定应该继承表的html_node
。见http://selectorgadget.com/
此外,rbind
由map2_df
(最后一行)内部完成
这导致:(使用%>% {head(.[,c("Bat1", "Runs", "Team")])}
)
Bat1 Runs Team
1 0 0 IND
2 3 3 IND
3 148 148 IND
4 56 56 IND
5 38 38 IND
6 20 20 IND