Question

我通过rvest创建了这个数据框：

## 00. Pick the webpage
FAO_Countries <- read_html("http://www.fao.org/countryprofiles/en/")


## 01. Get the urls I am interested in
FAO_Countries_urls <- FAO_Countries %>% 
html_nodes(".linkcountry") %>% 
html_attr("href")


## 02. Get the links I am interested in
FAO_Countries_links <- FAO_Countries %>%
html_nodes(".linkcountry") %>% 
html_text()


## 03. Prepare the complete urls
url <- "http://www.fao.org"


## 04. Create a dataframe
FAO_Countries_data <- data.frame(FAO_Countries_links =FAO_Countries_links,
                             FAO_Countries_urls = paste0(url,FAO_Countries_urls), 
                             stringsAsFactors = FALSE)

确定。至此，我得到的结果是：

> head(FAO_Countries_data, n=4)  
  FAO_Countries_links           FAO_Countries_urls

       1 Afghanistan            http://www.fao.org/countryprofiles/en/?iso3=AFG
       2 Albania                http://www.fao.org/countryprofiles/en/?iso3=ALB
       3 Algeria                http://www.fao.org/countryprofiles/en/?iso3=DZA
       4 Andorra                http://www.fao.org/countryprofiles/en/?iso3=AND

等等。任何网址都会指向一个页面，其中与其他网址存在共同元素。事实上，任何国家都以同样的方式进行分析。我会对“食品安全”感兴趣。为了收集信息，我会创建一些新的变量。事实上，最紧迫的是“从任何网址捕获的粮食安全。最后，数据框应如下所示：

> head(FAO_Countries_data, n=4)  
  FAO_Countries_links    FAO_Countries_urls                   Food_Sec       
  1 Afghanistan          http://www.fao.org/countryprofiles... BLABLA

事实上，我应该能够 - 我想 - 从数据框的变量“FAO_Countries_urls”中获取这些网址作为真正的网址，然后抓住它们。或许还有另一种出路。提前感谢任何建议。

P.S：是的，我正在使用'Selectorgadget'

rvest - 从网页上抓取链接

0 个答案: