Scrape R For Loop-为我获取字母范围A到Z

时间:2019-02-07 22:37:19

标签: r rvest

我正在努力吸引今年SXSW的演讲者:https://schedule.sxsw.com/2019/speakers/alpha/A

链接的末尾有一个A,但它经过Z(例如,在末尾添加BC等。链接。

这是我的尝试:

library(RCurl)
library(httr)
library(rvest)
library(tidyverse)

sxsw <- 'https://schedule.sxsw.com/2019/speakers/alpha/A'
page <- read_html(sxsw)

for (i in length(LETTERS)) {

    sxsw <- paste0('https://schedule.sxsw.com/2019/speakers/alpha/',  LETTERS[i])

    names <- page %>% 
     html_nodes(".px1 a") %>% 
     html_text()

}

我只是尝试附加整个范围,因此它返回所有发言人姓名。如果将names向量带出循环,然后运行它,则会弹出所有A名称。我认为这是一个快速解决方案-认为它与LETTERS有关。谢谢

2 个答案:

答案 0 :(得分:0)

这应该可以解决问题...

library(tidyverse)
library(rvest)

tibble(
  url = paste0('https://schedule.sxsw.com/2019/speakers/alpha/',  LETTERS[1:26])
  )  %>% 
  mutate(
    names = map(url, read_html),
    names = map(names, html_nodes, ".px1 a"),
    names = map(names, html_text)
  ) %>% 
  unnest()


答案 1 :(得分:0)

使用lapply的代码。我建议避免在R中使用循环

library(RCurl)
library(httr)
library(rvest)
library(tidyverse)
sxsw=list()
letters=toupper(letters)
sxsw <-lapply(letters,function(x){
read_html(paste0("https://schedule.sxsw.com/2019/speakers/alpha/",paste0(x)))%>% html_nodes(".px1 a") %>% 
  html_text()
}
)