我试图网络废弃奥巴马的spechees页面,创建像wordclouds等的东西。
当我尝试为1,5,10个不同的页面(演讲)而不是循环时,这些代码可以正常运行。但是我创建了这个循环(上图),结果对象不包含任何内容(NULL
)。
有人可以帮助我吗?
library(wordcloud)
library(tm)
library(XML)
library(RCurl)
site <- "http://obamaspeeches.com/"
url <- readLines(site)
h <- htmlTreeParse(file = url, asText = TRUE, useInternalNodes = TRUE,
encoding = "utf-8")
# getting the phrases that will form the web adresses for the speeches
teste <- data.frame(h[42:269, ])
teste2 <- teste[grep("href=", teste$h.42.269...), ]
teste2 <- as.data.frame(teste2)
teste3 <- gsub("^.*href=", "", teste2[, "teste2"])
teste3 <- as.data.frame(teste3)
teste4 <- gsub("^/", "", teste3[, "teste3"])
teste4 <- as.data.frame(teste4)
teste5 <- gsub(">.*$", "", teste4[, "teste4"])
teste5 <- as.data.frame(teste5)
# loop to read pages
l <- vector(mode = "list", length = nrow(teste5))
i <- 1
for (i in nrow(teste5)) {
site <- paste("http://obamaspeeches.com/", teste5[i, ], sep = "")
url <- readLines(site)
l[[i]] <- url
i <- i + 1
}
str(l)
答案 0 :(得分:1)
rvest
包通过抓取和解析使这一点变得相当简单,尽管可能需要了解CSS或XPath选择器。这比在HTML上使用正则表达式要好得多,不鼓励使用正确的HTML解析器(如rvest
!)。
如果您正在尝试抓取一堆子页面,可以创建一个URL向量,然后在其中lapply
来抓取并解析每个页面。这种方法的优点(超过for
循环)是它返回一个列表,其中包含每次迭代的项which will be much easier to deal with afterwards。如果你想要完全Hadleyverse,你可以改用purrr::map
,这样你就可以把它全部变成一个大的顺序链。
library(rvest)
baseurl <- 'http://obamaspeeches.com/'
# For this website, get the HTML,
links <- baseurl %>% read_html() %>%
# select <a> nodes that are children of <table> nodes that are aligned left,
html_nodes(xpath = '//table[@align="left"]//a') %>%
# and get the href (link) attribute of that node.
html_attr('href')
# Loop across the links vector, applying a function that
speeches <- lapply(links, function(url){
# pastes the ULR to the base URL,
paste0(baseurl, url) %>%
# fetches the HTML for that page,
read_html() %>%
# selects <table> nodes with a width of 610,
html_nodes(xpath = '//table[@width="610"]') %>%
# get the text, trimming whitespace on the ends,
html_text(trim = TRUE) %>%
# and break the text back into lines, trimming excess whitespace for each.
textConnection() %>% readLines() %>% trimws()
})