通过rvest刮取谷歌搜索结果

时间:2018-01-17 17:48:48

标签: r for-loop url web-scraping rvest

我有一个像这样的数据框:

library(rvest)
library(urltools)
library(tm)

names <- ("bloomberg", "wilson athletics", "houston rockets", "gap inc")
website1 <- "NA"
website2 <- "NA"

df <- data.frame(names, website1, website2)

以下代码为Google搜索中的前2个网址提供了名称作为搜索字词:

for(name in df$name[1:4]){
  print(paste0("finding the url for:", name))
  Sys.sleep(20)

  url = URLencode(paste0("https://www.google.com/search?q=", name))
  page1 <- read_html(url)
  results <- page1 %>% html_nodes("cite") %>% html_text()
  result1 <- as.character(results[1])
  df[df$name == name,]$website1 <- result1
  result2 <- as.character(results[2])
  df[df$name == name,]$website2 <- result2
} 

我的问题是:

  1. 如何在没有在上面的循环中明确键入df$website1...df$website10的情况下,将上述内容概括为前10个网址?我知道我可以使用以下内容创建变量:

    sites <- paste("website", 1:10, sep = ".") na_vector <- rep(NA, nrow(df)) for(s in sites) { df[[s]] <- na_vector }

  2. 但我不确定如何有效地将其纳入上述循环中。 任何建议或意见将不胜感激。感谢。

1 个答案:

答案 0 :(得分:1)

您可以稍微清理一下您的函数,并使用更多tidyverse函数迭代一个名称列表,拉出您想要的内容,然后将其修剪下来。

library(tidyverse)
library(rvest)

names <- c("bloomberg", "wilson athletics", "houston rockets", "gap inc")

df <- data.frame(names = names)

scrape <- function(name) {

  print(paste0("finding the url for:", name))
  Sys.sleep(2)

  url <- URLencode(paste0("https://www.google.com/search?q=", name))

  read_html(url) %>% 
    html_nodes("cite") %>% 
    html_text()


}

result <- mutate(df, result = map(names, scrape)) 

result %>% 
  unnest() %>%            # unnest the list column result
  group_by(names) %>%     # Group by names
  slice(1:2) %>%          # Get the first two results per name
  mutate(website = sprintf("website%s", 1:2)) %>% #add in website ids
  spread(website, result) # put the website1 and 2 into columns

# A tibble: 4 x 3
# Groups:   names [4]
  names            website1                   website2                                           
* <fct>            <chr>                      <chr>                                              
1 bloomberg        https://www.bloomberg.com/ Reuters                                            
2 gap inc          www.gapinc.com/            https://en.wikipedia.org/wiki/Gap_Inc.             
3 houston rockets  NBA.com                    USA TODAY                                          
4 wilson athletics www.wilson.com/en-us       https://en.wikipedia.org/wiki/Wilson_Sporting_Goods