我有一个数据框pubs
,其中有两列:url
,html.node
。我想编写一个循环,以读取每个url来检索html内容,并提取html.node
列所指示的信息,并将其累积在数据帧或列表中。
所有URL都不同,所有html节点都不同。
到目前为止,我的代码是:
score <- vector()
k <- 1
for (r in 1:nrow(pubs)){
art.url <- pubs[r, 1] # column 1 contains URL
art.node <- pubs[r, 2] # column 2 contains html nodes as charcters
art.contents <- read_html(art.url)
score <- art.contents %>% html_nodes(art.node) %>% html_text()
k<-k+1
print(score)
}
感谢您的帮助。
答案 0 :(得分:2)
首先,请确保要刮取的每个站点都允许您刮取数据,如果您违反某些规则,可能会引起法律问题。
(注意,由于您未提供数据,我只使用了http://toscrape.com/,这是一个用于抓取的沙盒网站)
之后,您可以继续进行下去,希望对您有所帮助:
# first, your data I think they're similar to this
pubs <- data.frame(site = c("http://quotes.toscrape.com/",
"http://quotes.toscrape.com/"),
html.node = c(".text",".author"), stringsAsFactors = F)
然后您需要循环:
library(rvest)
# an empty list, to fill with the scraped data
empty_list <- list()
# here you are going to fill the list with the scraped data
for (i in 1:nrow(pubs)){
art.url <- pubs[i, 1] # choose the site as you did
art.node <- pubs[i, 2] # choose the node as you did
# scrape it!
empty_list[[i]] <- read_html(art.url) %>% html_nodes(art.node) %>% html_text()
}
现在结果是一个列表,但是带有:
names(empty_list) <- pubs$site
您将在列表的每个元素中添加站点名称,结果为:
$`http://quotes.toscrape.com/`
[1] "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”"
[2] "“It is our choices, Harry, that show what we truly are, far more than our abilities.”"
[3] "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”"
[4] "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”"
[5] "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"
[6] "“Try not to become a man of success. Rather become a man of value.”"
[7] "“It is better to be hated for what you are than to be loved for what you are not.”"
[8] "“I have not failed. I've just found 10,000 ways that won't work.”"
[9] "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”"
[10] "“A day without sunshine is like, you know, night.”"
$`http://quotes.toscrape.com/`
[1] "Albert Einstein" "J.K. Rowling" "Albert Einstein" "Jane Austen" "Marilyn Monroe" "Albert Einstein" "André Gide"
[8] "Thomas A. Edison" "Eleanor Roosevelt" "Steve Martin"
显然,它应该在不同的站点和不同的节点上工作。
答案 1 :(得分:1)
您还可以使用map
包中的purrr
而不是循环:
expand.grid(c("http://quotes.toscrape.com/", "http://quotes.toscrape.com/tag/inspirational/"), # vector of urls
c(".text",".author"), # vector of nodes
stringsAsFactors = FALSE) %>% # assuming that the same nodes are relevant for all urls, otherwise you would have to do something like join
as_tibble() %>%
set_names(c("url", "node")) %>%
mutate(out = map2(url, node, ~ read_html(.x) %>% html_nodes(.y) %>% html_text())) %>%
unnest()