R webscraping - 链接和网址

时间:2016-12-07 11:31:41

标签: r url web-scraping

我遇到与Scraping a web page, links on a page, and forming a table with R类似的问题。我会发布这个作为对该主题的评论,但我还没有得分。

我有以下代码:

## Import web page
FAO_Countries <- read_html("http://www.fao.org/countryprofiles/en/")

## Import the urls I am interested in with 'selectorgadget'
FAO_Countries_urls <- FAO_Countries %>% 
 html_nodes(".linkcountry") %>% 
 html_attr("href")

## Import the links I am interested in with 'slectorgadget'
FAO_Countries_links <- FAO_Countries %>%
html_nodes(".linkcountry") %>% 
html_text()

## I create a dataframe with two previous objects
FAO_Countries_data <- data.frame(FAO_Countries_links = FAO_Countries_links, 
FAO_Countries_urls = FAO_Countries_urls, stringsAsFactors = FALSE)

此时,我想从我所获得的网址中获取文本,并在右侧添加一列,并为我需要的其他内容执行此操作。不过,当我编译

FAO_Countries_data_text <- FAO_Countries_data$FAO_Countries_urls %>%
html_nodes("#foodSecurity-1") %>%
html_text()

我收到以下错误消息:

Error in UseMethod("xml_find_all") : 
no applicable method for 'xml_find_all' applied to an object of class "character"

换句话说,我无法从新制作的数据框中获取链接。

现在,我的数据框如下所示:

> head(FAO_Countries_data, n=3)
  FAO_Countries_links                  FAO_Countries_urls
  1         Afghanistan /countryprofiles/index/en/?iso3=AFG
  2             Albania /countryprofiles/index/en/?iso3=ALB
  3             Algeria /countryprofiles/index/en/?iso3=DZA

我想通过添加包含各种网址中存在的信息的列来扩展此数据框。例如:

FAO_Countries_links                  FAO_Countries_urls      Food_security
  1         Afghanistan /countryprofiles/index/en/?iso3=AFG Family farming

0 个答案:

没有答案