网页抓取 - 未找到任何记录

时间:2017-02-07 14:31:41

标签: html r xml dataframe web-scraping

我试图攻击一系列HTML表格(来自具有相同列名的不同页面),但有些页面没有记录" ,我想跳过这些页面或为数据帧分配NULL。

示例Dataframe 1

url="http://stats.espncricinfo.com/ci/engine/player/28081.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=1;template=results;type=batting;view=match"

Batting=readHTMLTable(url)

Batting$"Match by match list"

Batting<-Batting$"Match by match list"

Dataframe 2

    url="http://stats.espncricinfo.com/ci/engine/player/625383.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=2;template=results;type=batting;view=match"



Batting=readHTMLTable(url)

Batting$"Match by match list"

Batting<-Batting$"Match by match list"

有几个这样的数据框有表格形式的记录,有些没有记录

当我绑定没有记录的那个导致最终数据帧出错时

final_DF<-rbind(Dataframe1,Dataframe2)

我该如何解决这个问题??

PS:对于每个网址查询,我根据我对数据框的要求添加了一组列(比如使用cbind添加了5个列)。

1 个答案:

答案 0 :(得分:0)

您可以执行以下操作:

require(rvest)
require(tidyverse)

urls <- c(
  "http://stats.espncricinfo.com/ci/engine/player/28081.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=1;template=results;type=batting;view=match",
  "http://stats.espncricinfo.com/ci/engine/player/625383.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=2;template=results;type=batting;view=match"
)

extra_cols <- list(
  tibble("Team"="IND","Player"="MS.Dhoni","won"=1,"lost"=0,"D"=1,"D/N"=0,"innings"=1,"Format"="ODI"),
  tibble("Team"="IND","Player"="MS.Dhoni","won"=1,"lost"=0,"D"=1,"D/N"=0,"innings"=1,"Format"="ODI")
)

doc <- map(urls, read_html) %>% 
  map(html_node, ".engineTable:nth-child(5)")

keep <- map_lgl(doc, ~class(.) != "xml_missing")

map(doc[keep], html_table, fill = TRUE) %>% 
  map2_df(extra_cols[keep], cbind)

关键部分是discard,它删除了类“xml_missing”的所有列表元素,例如空的。

我与你的代码进行比较我使用CSS选择器来指定应该继承表的html_node。见http://selectorgadget.com/

此外,rbindmap2_df(最后一行)内部完成

这导致:(使用%>% {head(.[,c("Bat1", "Runs", "Team")])}

  Bat1 Runs Team
1    0    0  IND
2    3    3  IND
3  148  148  IND
4   56   56  IND
5   38   38  IND
6   20   20  IND