r中的网络抓取(带循环)

时间:2016-06-23 21:55:53

标签: r parsing web-scraping html-parsing

我试图网络废弃奥巴马的spechees页面,创建像wordclouds等的东西。 当我尝试为1,5,10个不同的页面(演讲)而不是循环时,这些代码可以正常运行。但是我创建了这个循环(上图),结果对象不包含任何内容(NULL)。

有人可以帮助我吗?

library(wordcloud)
library(tm)
library(XML)
library(RCurl)

site <- "http://obamaspeeches.com/"
url <- readLines(site)

h <- htmlTreeParse(file = url, asText = TRUE, useInternalNodes = TRUE, 
    encoding = "utf-8")

# getting the phrases that will form the web adresses for the speeches
teste <- data.frame(h[42:269, ])
teste2 <- teste[grep("href=", teste$h.42.269...), ]
teste2 <- as.data.frame(teste2)
teste3 <- gsub("^.*href=", "", teste2[, "teste2"])
teste3 <- as.data.frame(teste3)
teste4 <- gsub("^/", "", teste3[, "teste3"])
teste4 <- as.data.frame(teste4)
teste5 <- gsub(">.*$", "", teste4[, "teste4"])
teste5 <- as.data.frame(teste5)

# loop to read pages

l <- vector(mode = "list", length = nrow(teste5))
i <- 1
for (i in nrow(teste5)) {
    site <- paste("http://obamaspeeches.com/", teste5[i, ], sep = "")
    url <- readLines(site)
    l[[i]] <- url
    i <- i + 1
}

str(l)

1 个答案:

答案 0 :(得分:1)

rvest包通过抓取和解析使这一点变得相当简单,尽管可能需要了解CSS或XPath选择器。这比在HTML上使用正则表达式要好得多,不鼓励使用正确的HTML解析器(如rvest!)。

如果您正在尝试抓取一堆子页面,可以创建一个URL向量,然后在其中lapply来抓取并解析每个页面。这种方法的优点(超过for循环)是它返回一个列表,其中包含每次迭代的项which will be much easier to deal with afterwards。如果你想要完全Hadleyverse,你可以改用purrr::map,这样你就可以把它全部变成一个大的顺序链。

library(rvest)

baseurl <- 'http://obamaspeeches.com/' 

         # For this website, get the HTML,
links <- baseurl %>% read_html() %>% 
    # select <a> nodes that are children of <table> nodes that are aligned left,
    html_nodes(xpath = '//table[@align="left"]//a') %>% 
    # and get the href (link) attribute of that node.
    html_attr('href')

            # Loop across the links vector, applying a function that
speeches <- lapply(links, function(url){
    # pastes the ULR to the base URL,
    paste0(baseurl, url) %>% 
    # fetches the HTML for that page,
    read_html() %>% 
    # selects <table> nodes with a width of 610,
    html_nodes(xpath = '//table[@width="610"]') %>% 
    # get the text, trimming whitespace on the ends,
    html_text(trim = TRUE) %>% 
    # and break the text back into lines, trimming excess whitespace for each.
    textConnection() %>% readLines() %>% trimws()
})