Question

我有一个带有两个变量的data.frame，其中一个变量只包含URL。我想使用这些URL来抓取所有这些URL，并从每个URL中提取相关的文本信息，并通过这样做 - 向数据框中添加变量，以便在那里准备文本分析。

FAO_CountryName     FAO_CountryURL
Algeria             http://www.fao.org/giews/countrybrief/country.jsp?code=DZA
Egypt               http://www.fao.org/giews/countrybrief/country.jsp?code=EGY

换句话说，我想找到一种方法，将这些URL视为正确的html页面，我可以通过read_html（）命令将它们删除。

想法是在流程结束时有这样的事情：

    FAO_CountryName     FAO_CountryURL                                             FAOText
Algeria             http://www.fao.org/giews/countrybrief/country.jsp?code=DZA     Algeria is an interesting country
Egypt               http://www.fao.org/giews/countrybrief/country.jsp?code=EGY     Egypt is interesting as well but in a different way

Answer 1

我们首先定义一个函数来从a中获取我们想要的信息特定网址：

library(rvest)
scrapeFAO <- function(x) {
    as.character(x) %>% 
        read_html() %>% 
        html_nodes('.Normal') %>% 
        .[1] %>% 
        html_text()
}

scrapeFAO("http://www.fao.org/giews/countrybrief/country.jsp?code=DZA")
# [1] "Reference Date: 24-November-2016"

这只是一个例子，我们实际上可以在该页面上收集我们需要的任何元素。请阅读有关rvest包的更多信息，以扩展此处的可能性。

然后我们要将此函数应用于每一行，并cbind将结果应用于初始dataframe：

final <- cbind(mydf, apply(mydf[2],1, scrapeFAO))

请注意，可能有更有效的方法。

希望这有帮助

从变量进行Webscraping

1 个答案: