我有一个如下所示的数据框:
country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", "http://en.wikipedia.org/wiki/Canada",
"http://en.wikipedia.org/wiki/Japan", "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)
country link
1 Canada http://en.wikipedia.org/wiki/United_States
2 US http://en.wikipedia.org/wiki/Canada
3 Japan http://en.wikipedia.org/wiki/Japan
4 China http://en.wikipedia.org/wiki/China
使用rvest
我想为每个网址抓取目录,并将它们绑定到一个输出。
此代码提取一个网址的目录:
library(rvest)
toc <- html(url) %>%
html_nodes(".toctext") %>%
html_text()
期望的输出:
country toc
US Etymology
History
Native American and European contact
Settlements
...
Canada Etymology
History
Aboriginal peoples
European colonization
...etc
答案 0 :(得分:5)
这会将它们划分为一个完整的数据框(每个TOC条目一行)。繁琐但直截了当的“打印/输出”代码留给了OP:
library(rvest)
library(dplyr)
country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States",
"http://en.wikipedia.org/wiki/Canada",
"http://en.wikipedia.org/wiki/Japan",
"http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)
bind_rows(lapply(url, function(x) {
data.frame(url=x, toc_entry=toc <- html(url[1]) %>%
html_nodes(".toctext") %>%
html_text())
})) -> toc_entries
df <- toc_entries %>% left_join(df)
df[sample(nrow(df), 10),]
## Source: local data frame [10 x 3]
##
## url toc_entry country
## 1 http://en.wikipedia.org/wiki/Japan Government finance Japan
## 2 http://en.wikipedia.org/wiki/Canada Cold War and civil rights era US
## 3 http://en.wikipedia.org/wiki/United_States Food Canada
## 4 http://en.wikipedia.org/wiki/Japan Sports Japan
## 5 http://en.wikipedia.org/wiki/Canada Religion US
## 6 http://en.wikipedia.org/wiki/China Cold War and civil rights era China
## 7 http://en.wikipedia.org/wiki/Japan Literature, philosophy, and the arts Japan
## 8 http://en.wikipedia.org/wiki/United_States Population Canada
## 9 http://en.wikipedia.org/wiki/Japan Settlements Japan
## 10 http://en.wikipedia.org/wiki/Canada Military US