信息分散在多个视图中

时间:2018-01-05 14:57:08

标签: r rvest rselenium

我想抓住this page左侧的排名,这个排名分布在34个视图上,而且我相信(总新手需要抓取)才能进行Java处理。所有视图都有相同的URL,所以我无法循环这些。

据我所知,每个视图似乎都有节点#elferspielerhistorie_subcont_j td,从j=0开始。

我可以用

抓取第一个条目
library(rvest)
library(tidyverse)

elfer_url <- "http://www.kicker.de/news/fussball/bundesliga/spieltag/1-bundesliga/elfmeter-schuetzen-geschichte.html"

# first page
elfmeter <- read_html(elfer_url)
Schuetzen <- elfmeter %>% html_nodes("#elferspielerhistorie_subcont_0 td") %>% html_text()

我的战略&#34;然后使用RSelenium在下一页的paste链接上单击并执行操作。然而,循环返回下一个33个视图的空条目(完整性的完整代码):

library(rvest)
library(tidyverse)
library(RSelenium)

elfer_url <- "http://www.kicker.de/news/fussball/bundesliga/spieltag/1-bundesliga/elfmeter-schuetzen-geschichte.html"

rD <- rsDriver(port = 4444L, browser = "firefox")
remDr <- rD$client
remDr$navigate(elfer_url)

# first page
elfmeter <- read_html(elfer_url)
Schuetzen <- elfmeter %>% html_nodes("#elferspielerhistorie_subcont_0 td") %>% html_text() %>% matrix(ncol=10, byrow=T) %>% data.frame()

clicknext <- remDr$findElements("xpath","//*[@id='ctl00_PlaceHolderContent_elfer_blaettern_elferhistorie_PagerForward']")

j <- 1
while (j<=34){
  clicknext[[1]]$clickElement()     # sends me to the right view
  #elfmeter <- read_html(elfer_url) # switching this on or off does not change things
  current.node <- paste0("#elferspielerhistorie_subcont_",j," td") # should be the node
  weitere_Schuetzen <- elfmeter %>% html_node(current.node) %>% html_text() %>% matrix(ncol=10, byrow=T) %>% data.frame() # returns empty result
  Schuetzen <- rbind(Schuetzen,weitere_Schuetzen)

  j <- j+1
}

1 个答案:

答案 0 :(得分:2)

由于视图是动态生成的,因此您必须在每个回合中获取页面源。可能是,下一个按钮的ID发生了变化,所以保存在每次迭代时都会找到该按钮。

以下代码应该有效。请注意,我还读出了循环结束时丢弃的空行:

library(rvest)
library(tidyverse)
library(RSelenium)

elfer_url <- "http://www.kicker.de/news/fussball/bundesliga/spieltag/1-bundesliga/elfmeter-schuetzen-geschichte.html"

rD <- rsDriver(port = 4447L, browser = "firefox")
remDr <- rD$client
remDr$navigate(elfer_url)

getTable <- function(x) {
  remDr$getPageSource()[[1]] %>% 
    read_html %>% 
    html_nodes(paste0("#elferspielerhistorie_subcont_", x, " table")) %>% 
    html_table(fill = T)  %>% 
    .[[1]] %>% 
    data.frame
}

# first page
data <- getTable(0)

for(j in 1:33) {
  next_button <- remDr$findElements("css","a[id=\"ctl00_PlaceHolderContent_elfer_blaettern_elferhistorie_PagerForward\"]") %>% .[[1]]
  remDr$executeScript(script = "arguments[0].scrollIntoView(true);", args = list(next_button))
  next_button$clickElement()
  # sometimes the loop is too fast and it cannot fetch the table. so pause here
  Sys.sleep(1)
  data <- rbind(data, getTable(j))

  j <- j+1

}
rD$server$stop()

data <- data[-which(data$Spieler == ""),]
dim(data)

> [1] 935  10