循环遍历R中的URL

时间:2016-02-04 22:24:11

标签: r loops

我正在尝试从500多个网址中提取多个数据,这些网址的结构都相同:www.domain.com/something-else_uniqueID

我尝试过的代码是:

url <- c("www.domain.com/something-else_uniqueID",
         "www.domain.com/something-else_uniqueID2",
         "www.domain.com/something-else_uniqueID3")

lapply(url, function(x) {

data.frame(url=url, 
         category=category <- read_html(url) %>%
           html_nodes(xpath = '//*[@id="content-anchor"]/div[1]/div[2]/div[1]/span[2]/a') %>%
           html_text(),

         sub_category=sub_category <- read_html(url) %>%
           html_nodes(xpath = '//*[@id="content-anchor"]/div[1]/div[2]/div[1]/span[3]/a') %>%
           html_text(),

         section=section <- read_html(url) %>%
           html_nodes(xpath = '//*[@id="content-anchor"]/div[1]/div[2]/div[1]/span[4]/a') %>%
           html_text())

}) -> my_effort

write.csv(my_effort, "mydata.csv")
  1. RStudio以红色返回:错误:期待单个值
  2. 因为网址太多,有没有比c()更有效的方式?
  3. 非常感谢您的帮助。

1 个答案:

答案 0 :(得分:1)

问题是你在函数中使用url而你宁愿使用x这是当前正在迭代的项目

尝试

url <- c("www.domain.com/something-else_uniqueID",
         "www.domain.com/something-else_uniqueID2",
         "www.domain.com/something-else_uniqueID3")

Reduce(function(...) merge(..., all=T), 
    lapply(url, function(x) {
       data.frame(url=x, 
           category=category <- read_html(x) %>%
                   html_nodes(xpath = '//*[@id="content-anchor"]/div[1]/div[2]/div[1]/span[2]/a') %>%
                   html_text(),

           sub_category=sub_category <- read_html(x) %>%
                   html_nodes(xpath = '//*[@id="content-anchor"]/div[1]/div[2]/div[1]/span[3]/a') %>%
                   html_text(),

           section=section <- read_html(x) %>%
                   html_nodes(xpath = '//*[@id="content-anchor"]/div[1]/div[2]/div[1]/span[4]/a') %>%
                   html_text())

    })) -> my_effort

write.csv(my_effort, "mydata.csv")