我遇到与Scraping a web page, links on a page, and forming a table with R类似的问题。我会发布这个作为对该主题的评论,但我还没有得分。
我有以下代码:
## Import web page
FAO_Countries <- read_html("http://www.fao.org/countryprofiles/en/")
## Import the urls I am interested in with 'selectorgadget'
FAO_Countries_urls <- FAO_Countries %>%
html_nodes(".linkcountry") %>%
html_attr("href")
## Import the links I am interested in with 'slectorgadget'
FAO_Countries_links <- FAO_Countries %>%
html_nodes(".linkcountry") %>%
html_text()
## I create a dataframe with two previous objects
FAO_Countries_data <- data.frame(FAO_Countries_links = FAO_Countries_links,
FAO_Countries_urls = FAO_Countries_urls, stringsAsFactors = FALSE)
此时,我想从我所获得的网址中获取文本,并在右侧添加一列,并为我需要的其他内容执行此操作。不过,当我编译
FAO_Countries_data_text <- FAO_Countries_data$FAO_Countries_urls %>%
html_nodes("#foodSecurity-1") %>%
html_text()
我收到以下错误消息:
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "character"
换句话说,我无法从新制作的数据框中获取链接。
现在,我的数据框如下所示:
> head(FAO_Countries_data, n=3)
FAO_Countries_links FAO_Countries_urls
1 Afghanistan /countryprofiles/index/en/?iso3=AFG
2 Albania /countryprofiles/index/en/?iso3=ALB
3 Algeria /countryprofiles/index/en/?iso3=DZA
我想通过添加包含各种网址中存在的信息的列来扩展此数据框。例如:
FAO_Countries_links FAO_Countries_urls Food_security
1 Afghanistan /countryprofiles/index/en/?iso3=AFG Family farming