Question

我正在使用R为其数据抓取以下网页：http://www.baseball-reference.com/boxes/BAL/BAL201403310.shtml。我感兴趣的一个特殊概念是开始时间天气（位于页面的一半），但我无法抓取这些信息。

使用选择器小工具，我编码：

game <- read_html(x= "http://www.baseball-reference.com/boxes/BAL/BAL201403310.shtml")

weather <- game %>% 
html_node(".section_wrapper+ .section_wrapper div:nth-child(5)") %>%
html_text() 

weather

[1] NA

如何修改我的代码以避免NA？这也发生在其他游戏的页面中。

我希望你能帮助我！我似乎无法找到正确的道路。

Answer 1

您可以在解析数据之前使用readLines对“开始时间天气”行进行分组，如下所示：

#http://www.baseball-reference.com/boxes/ARI/ARI201403220.shtml
lines <- readLines("http://www.baseball-reference.com/boxes/BAL/BAL201403310.shtml")

library(rvest)
weather <- read_html(lines[which(grepl("Start Time Weather", lines))]) %>% 
    html_node("div") %>% 
    html_text()
gsub("Start Time Weather: ", "", weather)

抓取网页的问题

1 个答案: