R Web Scraping - 数据框

时间:2018-04-16 12:26:45

标签: r web-scraping rvest

我正在尝试使用以下变量创建数据框。但是,在使用SelectorGadget工具确定刮取此信息所需的CSS选择器之后,向量会产生不同的值。即使直接从HTML源代码复制选择器。如果正确完成,此表应该有34行。这是我的代码和相应的错误:

womens_bb <- read_html("http://gomason.com/schedule.aspx?path=wbball")

womens_opponents <- womens_bb %>%
html_nodes(".sidearm-schedule-game-opponent-name a") %>%
html_text()

womens_locations <- womens_bb %>%
html_nodes(".sidearm-schedule-game-location span:nth-child(1)") %>%
html_text()

womens_dates <- womens_bb %>%
html_nodes(".sidearm-schedule-game-opponent-date span:nth-child(1)") %>%
html_text() 

womens_times <- womens_bb %>%
html_nodes(".sidearm-schedule-game-opponent-date span:nth-child(2)") %>%
html_text()
as.numeric()

womens_scores <- womens_bb %>%
html_nodes("div.sidearm-schedule-game-result span:nth-child(3)") %>%
html_text()
as.numeric() 

womens_win_loss <- womens_bb %>%
html_nodes(".text-italic span:nth-child(2)") %>%
html_text() %>%
str_replace("\\,", "")

womens_df <- data_frame(
  date = womens_dates, time = womens_times, opponent = womens_opponents, location = womens_locations, score = womens_scores, win_loss = womens_win_loss)

Error: Columns `date`, `time`, `opponent`, `score`, `win_loss` must be length 1 or 37, not 36, 36, 34, 34, 35

如何解决此问题?

1 个答案:

答案 0 :(得分:1)

我认为img标签存在一些问题。所以为了避免这些,您可以先收集全局div标签(当我执行脚本时为36),并在内部循环以获得结果。如果对标签看起来很奇怪,那就执行一点:

womens_bb <- read_html("http://gomason.com/schedule.aspx?path=wbball")
divs <- womens_bb %>% html_nodes(".sidearm-schedule-game")

for (div in divs){

  womens_opponents <- div %>%
    html_nodes(".sidearm-schedule-game-opponent-name, .sidearm-schedule-game-opponent-name a") %>%
    html_text
  womens_opponents <- gsub("\\s{2,}","",womens_opponents[1])

  womens_locations <- div %>%
    html_nodes(".sidearm-schedule-game-location span:nth-child(1)") %>%
    html_text()
  womens_locations <- womens_locations[1]

  womens_dates <- div %>%
    html_nodes(".sidearm-schedule-game-opponent-date span:nth-child(1)") %>%
    html_text() 

  womens_times <- div %>%
    html_nodes(".sidearm-schedule-game-opponent-date span:nth-child(2)") %>%
    html_text()

  womens_scores <- div %>%
    html_nodes("div.sidearm-schedule-game-result span:nth-child(3)") %>%
    html_text()
  if(length(womens_scores)==0) womens_scores = ""

  womens_win_loss <- div %>%
    html_nodes(".text-italic span:nth-child(2)") %>%
    html_text()
  womens_win_loss <-   gsub("\\,", "",womens_win_loss)  

  res <- c(date = womens_dates, time = womens_times, opponent = womens_opponents, location = womens_locations, score = womens_scores, win_loss = womens_win_loss)    
  print(length(res))
  df <- rbind(df,res)
}

希望这会有所帮助,

Gottavianoni