Rvest刮擦和循环

时间:2018-12-08 02:34:13

标签: r for-loop web-scraping rvest

我已经审查了与该相似主题相关的SO类似问题的几个答案,但似乎都没有用。

loop across multiple urls in r with rvest

Harvest (rvest) multiple HTML pages from a list of urls

我有一个URL列表,我希望从每个URL中获取表并将其附加到主数据框。

## get all urls into one list
page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
  url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i])
  urls[[i]] <- url
}


### loop over the urls and get the table from each page
table<- data.frame()
for (j in urls) {
  tbl<- urls[j] %>% 
    read_html() %>% 
    html_node("table") %>%
    html_table()
  table[[j]] <- tbl
}

第一部分按预期工作,并获取我要抓取的网址列表。我收到以下错误:

 Error in UseMethod("read_xml") : 
  no applicable method for 'read_xml' applied to an object of class "list"

关于如何更正此错误并将3个表循环到单个DF的任何建议?我感谢任何提示或指示。

2 个答案:

答案 0 :(得分:2)

这是您的问题:

for (j in urls) {
  tbl<- urls[j] %>% 

使用j in urls时,j的值不是整数,而是URL本身。

尝试:

for (j in 1:length(urls)) {
  tbl<- urls[[j]] %>% 
    read_html() %>% 
    html_node("table") %>%
    html_table()
  table[[j]] <- tbl
}

您也可以使用seq_along()

for (j in seq_along(urls))

答案 1 :(得分:1)

尝试一下:

library(tidyverse)
library(rvest)

page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
  url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i])
  urls[[i]] <- url
}

### loop over the urls and get the table from each page
tbl <- list()
j <- 1
for (j in seq_along(urls)) {
  tbl[[j]] <- urls[[j]] %>%   # tbl[[j]] assigns each table from your urls as an element in the tbl list
    read_html() %>% 
    html_node("table") %>%
    html_table()
  j <- j+1                    # j <- j+1 iterates over each url in turn and assigns the table from the second url as an element of tbl list, [[2]] in this case
}

#convert list to data frame
tbl <- do.call(rbind, tbl)
在原始代码中for循环末尾的

table[[j]] <- tbl是不必要的,因为我们将每个URL分配为tbl列表中的元素:tbl[[j]] <- urls[[j]]