我已经安装了这些包: rvest stringr tidyr data.table plyr XML2 selectr tibble 发出喉音 datapasta jsonlite
我抓住联合国粮农组织网站进行研究,最后得到了这个:
FAO_AreaName <-"MEX"
news_url <- paste0("http://www.fao.org/countryprofiles/common/allnews/en/?iso3=",FAO_AreaName,"&allnews=no&limit=2")
news<- fromJSON(news_url)
title <- news[3]
date <- news[6]
FAO_AreaName_1 <- news[5]
content_MEX <- news[5]
MEX <- cbind(FAO_AreaName, FAO_AreaName_1, date,title, content_MEX)
FAO_AreaName <-"FSM"
news_url <- paste0("http://www.fao.org/countryprofiles/common/allnews/en/?iso3=",FAO_AreaName,"&allnews=no&limit=2")
news<- fromJSON(news_url)
title <- news[3]
date <- news[6]
FAO_AreaName_1 <- news[5]
content_FSM <- news[5]
FSM <- cbind(FAO_AreaName, FAO_AreaName_1, date,title, content_FSM)
当我合并两个数据集时,我得到了这个:
MERGE <- merge(MEX, FSM, by="FAO_AreaName", all=T)
str(MERGE)
'data.frame': 4 obs. of 7 variables:
$ FAO_AreaName : Factor w/ 2 levels "MEX","FSM": 1 1 2 2
$ date_format.x: chr "27/11/2017" "16/11/2017" NA NA
$ title.x : chr "México es sede de reunión regional sobre la iniciativa de Crecimiento Azul de la FAO " "Lograr el hambre cero pasa por reducir la pérdida y desperdicio de alimentos" NA NA
$ bodytext.x : chr " \r\n\r\nLa Comisión Nacional de Acuacultura y Pesca de México es anfitriona de la principal reunión sobre la actividad en Amér"| __truncated__ " \r\n\r\nSe realiza Foro sobre el desperdicio y pérdida de alimentos en México: retos y soluciones, organizado en el Senado de"| __truncated__ NA NA
$ date_format.y: chr NA NA "11/11/2017" "11/11/2017"
$ title.y : chr NA NA "Pacific leaders alarmed over climate change’s negative impact on food systems and food security" "Pacific leaders alarmed over climate change’s negative impact on food systems and food security"
$ bodytext.y : chr NA NA "11 November 2017, Rome – Climate change poses an alarming threat to food systems and food security in the Pacific islands, warn"| __truncated__ "11 November 2017, Rome – Climate change poses an alarming threat to food systems and food security in the Pacific islands, warn"| __truncated__
当然,我不希望变量重复,例如.x和.y
答案 0 :(得分:0)
你是如此接近,甚至使用cbind
创建列表(不是数据框或数据表!)作为中间数据类型。我们将使用rbind
并修改您构建中间数据结构的方式,以帮助您找到正确的解决方案。
步骤如下:
您的代码需要进行一次小的更改。此解决方案不是使用cbind
生成列表,而是使用data.frame
生成数据框。请注意,参数stringsAsFactors
设置为FALSE
,以确保您的字符串类型为chr
而不是FACTOR
。
library(pacman)
p_load(rvest,
stringr,
tidyr,
data.table,
plyr,
xml2,
selectr,
tibble,
purrr,
datapasta,
jsonlite)
# code omitted for brevity, see original post above
MEX <- data.frame(FAO_AreaName, FAO_AreaName_1, date,title, content_MEX,
stringsAsFactors = F)
# code omitted for brevity, see original post above
FSM <- data.frame(FAO_AreaName, FAO_AreaName_1, date,title, content_FSM,
stringsAsFactors = F)
这个很简单。
> df <- rbind(MEX, FSM)
> dim(df)
[1] 4 5
> str(df, nchar.max = 30)
'data.frame': 4 obs. of 5 variables:
$ FAO_AreaName: chr "MEX" "MEX" "FSM" "FSM"
$ bodytext : chr " \r\n\r\nLa C"| __truncated__ " \r\n\r\nSe r"| __truncated__ "11 November 2"| __truncated__ "11 November 2"| __truncated__
$ date_format : chr "27/11/2017" "16/11/2017" "11/11/2017" "11/11/2017"
$ title : chr "México es sed"| __truncated__ "Lograr el ham"| __truncated__ "Pacific leade"| __truncated__ "Pacific leade"| __truncated__
$ bodytext.1 : chr " \r\n\r\nLa C"| __truncated__ " \r\n\r\nSe r"| __truncated__ "11 November 2"| __truncated__ "11 November 2"| __truncated__
改进建议
rbind
是一个,dplyr
有很多,包括bind_rows
和各种类型的。{1}}
联接。bodytext
(粮农组织区域名称1)中的字符\r\n\r\nLa C
,可以删除带有反斜杠的转义字符。 备注:强>
注1有些人会遇到fromJSON
的问题,因为他们正在工作并且存在某种超时错误。如果是这种情况,请使用this workaround:
# workaround in office
download.file(news_url, destfile = "scrapedpage.html", quiet=TRUE)
news<- fromJSON("scrapedpage.html")
笔记2
由于您用于从JSON数据结构中提取数据的方法,因此您的数据具有名称。例如:
> names(date)
[1] "date_format"
如果要覆盖这些名称,则需要稍微修改代码:
> MEX <- data.frame(FAO_AreaName = FAO_AreaName, FAO_AreaName_1, date,title, content_MEX,
+ stringsAsFactors = F)
> names(MEX)
[1] "FAO_AreaName" "bodytext" "date_format" "title" "bodytext.1"