我有一个像这样的数据框:
library(rvest)
library(urltools)
library(tm)
names <- ("bloomberg", "wilson athletics", "houston rockets", "gap inc")
website1 <- "NA"
website2 <- "NA"
df <- data.frame(names, website1, website2)
以下代码为Google搜索中的前2个网址提供了名称作为搜索字词:
for(name in df$name[1:4]){
print(paste0("finding the url for:", name))
Sys.sleep(20)
url = URLencode(paste0("https://www.google.com/search?q=", name))
page1 <- read_html(url)
results <- page1 %>% html_nodes("cite") %>% html_text()
result1 <- as.character(results[1])
df[df$name == name,]$website1 <- result1
result2 <- as.character(results[2])
df[df$name == name,]$website2 <- result2
}
我的问题是:
如何在没有在上面的循环中明确键入df$website1...df$website10
的情况下,将上述内容概括为前10个网址?我知道我可以使用以下内容创建变量:
sites <- paste("website", 1:10, sep = ".")
na_vector <- rep(NA, nrow(df))
for(s in sites) {
df[[s]] <- na_vector
}
但我不确定如何有效地将其纳入上述循环中。 任何建议或意见将不胜感激。感谢。
答案 0 :(得分:1)
您可以稍微清理一下您的函数,并使用更多tidyverse
函数迭代一个名称列表,拉出您想要的内容,然后将其修剪下来。
library(tidyverse)
library(rvest)
names <- c("bloomberg", "wilson athletics", "houston rockets", "gap inc")
df <- data.frame(names = names)
scrape <- function(name) {
print(paste0("finding the url for:", name))
Sys.sleep(2)
url <- URLencode(paste0("https://www.google.com/search?q=", name))
read_html(url) %>%
html_nodes("cite") %>%
html_text()
}
result <- mutate(df, result = map(names, scrape))
result %>%
unnest() %>% # unnest the list column result
group_by(names) %>% # Group by names
slice(1:2) %>% # Get the first two results per name
mutate(website = sprintf("website%s", 1:2)) %>% #add in website ids
spread(website, result) # put the website1 and 2 into columns
# A tibble: 4 x 3
# Groups: names [4]
names website1 website2
* <fct> <chr> <chr>
1 bloomberg https://www.bloomberg.com/ Reuters
2 gap inc www.gapinc.com/ https://en.wikipedia.org/wiki/Gap_Inc.
3 houston rockets NBA.com USA TODAY
4 wilson athletics www.wilson.com/en-us https://en.wikipedia.org/wiki/Wilson_Sporting_Goods