我正在根据不同质量的多个数据集构建一组年份和国家的完整指标时间表。
使用reshape2
我已经"融化了#34;将这些数据集合并为一个数据帧。
示例数据集:
d <- structure(list(cntry = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L,
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("BE",
"DE", "GE"), class = "factor"), year = c(1960L, 1970L, 1980L,
1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L,
1970L, 1960L, 1970L, 1960L, 1970L, 1970L, 1980L), indicator = c(5.5,
1.2, 1.5, NA, 1.4, NA, NA, 5.5, 1.2, 2.3, 1.4, NA, 1.4, NA, NA,
2.3, 1.4, 1.4, NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "male", class = "factor"),
source = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Council",
"Eurostat", "OECD"), class = "factor")), .Names = c("cntry",
"year", "indicator", "sex", "source"), class = "data.frame", row.names = c(NA,
-19L))
d
# cntry year indicator sex source
# 1 BE 1960 5.5 male Eurostat
# 2 BE 1970 1.2 male Eurostat
# 3 BE 1980 1.5 male Eurostat
# 4 DE 1960 NA male Eurostat
# 5 DE 1970 1.4 male Eurostat
# 6 GE 1960 NA male Eurostat
# 7 GE 1970 NA male Eurostat
# 8 BE 1960 5.5 male OECD
# 9 BE 1970 1.2 male OECD
# 10 DE 1960 2.3 male OECD
# 11 DE 1970 1.4 male OECD
# 12 GE 1960 NA male OECD
# 13 GE 1970 1.4 male OECD
# 14 BE 1960 NA male Council
# 15 BE 1970 NA male Council
# 16 DE 1960 2.3 male Council
# 17 DE 1970 1.4 male Council
# 18 GE 1970 1.4 male Council
# 19 GE 1980 NA male Council
我希望我可以cast()
使用fun.aggregate
将此长数据集转换为宽格式,同时为特定国家/地区选择最高质量的数据集(Eurostat&gt; OECD&gt; Council)年份组合填补缺席。不幸的是,我真的不明白如何使用这样的自定义聚合函数。
换句话说,我希望将数据集从长格式重新整形为宽格式,同时根据因子的值合并多个值(&#34;源&#34;)。理想情况下它可以起作用:
full_data <- expand.grid(c('BE', 'GE', 'DE'), c('1960', '1970', '1980'))
full_data <- fill_missings(full_data, d, pref_order=c('Eurostat', 'OECD', 'Council'))
full_data
# BE 1960 5.5 male Eurostat
# BE 1970 1.2 male Eurostat
# BE 1980 1.5 male Eurostat
# DE 1960 2.3 male OECD
# DE 1970 1.4 male Eurostat
# DE 1980 NA NA NA
# GE 1960 NA male Council
# GE 1970 1.4 male OECD
# GE 1980 NA male Council
并且可选地(或直接)采用宽格式:
# cntry sex 1960 1970 1980
# BE male 5.5 1.2 1.5
# DE male 2.3 1.4 NA
# GE male NA 1.4 NA
答案 0 :(得分:2)
假设数据符合您的要求,即source
列首先由Eurostat
排序,然后由OECD
排序,然后由council
排序,我和#39; d以这种方式使用data.table
:
require(data.table) # >= v1.9.0
setDT(d) # converts data.frame to data.table by reference
dcast.data.table(d, cntry + sex ~ year, value.var="indicator",
subset=.(!duplicated(d, by=c("cntry", "year", "indicator")) & !is.na(indicator)))
# cntry sex 1960 1970 1980
# 1: BE male 5.5 1.2 1.5
# 2: DE male 2.3 1.4 NA
# 3: GE male NA 1.4 NA
答案 1 :(得分:1)
我不确定这是否符合您的所有期望,但听起来您正在寻找以下内容:
toMerge <- expand.grid(cntry = c("BE", "DE", "GE"),
year = c(1960, 1970, 1980),
source = c("Eurostat", "OECD", "Council"),
sex = "male")
d2 <- merge(d, toMerge, all = TRUE)
d2$source <- factor(d2$source, c("Council", "OECD", "Eurostat"), ordered=TRUE)
d2 <- d2[order(d2$source, decreasing=TRUE), ]
Rank <- with(d2, ave(indicator, d2[c("cntry", "year", "sex")],
FUN = function(x) rank(x, ties.method="first", na.last=TRUE)))
D <- d2[Rank == 1, ]
D
# cntry year sex source indicator
# 2 BE 1960 male Eurostat 5.5
# 5 BE 1970 male Eurostat 1.2
# 8 BE 1980 male Eurostat 1.5
# 14 DE 1970 male Eurostat 1.4
# 17 DE 1980 male Eurostat NA
# 20 GE 1960 male Eurostat NA
# 26 GE 1980 male Eurostat NA
# 12 DE 1960 male OECD 2.3
# 24 GE 1970 male OECD 1.4
library(reshape2)
dcast(D, cntry ~ year, value.var="indicator")
# cntry 1960 1970 1980
# 1 BE 5.5 1.2 1.5
# 2 DE 2.3 1.4 NA
# 3 GE NA 1.4 NA
答案 2 :(得分:1)
也许以下内容也可以起作用:
library(reshape2)
x <- melt(d,id.vars=c("cntry","year","source","sex"))
y <- dcast(x,cntry+year+sex ~ source)
y$selected.value <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes=y$Council,no=y$OECD),no=y$Eurostat)
dcast(y,cntry + sex ~ year)
使用分层ifelse
语句进行源选择。使用此方法会丢失所选源的指示,如果这是一个问题,可以添加类似的ifelse
语句,从而创建源原始变量:
y$selected.source <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes="Council",no="OECD"),no="Eurostat")
答案 3 :(得分:0)
这是另一种选择:
library(reshape2)
d$source <- factor(d$source, levels=c('Eurostat', 'OECD', 'Council'))
d2 <- d[1:4]
d2[[3]] <- lapply(split(d, 1:nrow(d)), `[`, c(3, 5))
dcast(
d2, cntry + sex ~ year, value.var="indicator",
fun.aggregate=function(x) {
if(!length(x)) return(NA_real_)
xs <- do.call(rbind, x)
xs <- xs[complete.cases(xs), ]
if(nrow(xs)) xs[order(as.numeric(xs$source)), "indicator"][[1L]] else NA_real_
} )
产地:
cntry sex 1960 1970 1980
1 BE male 105.5 101.2 101.5
2 DE male 2.3 101.4 NA
3 GE male NA 1.4 NA
注意我在“Eurostat”值中添加了100,以使它们与其他值区别开来,因为在此示例集中它们似乎相等。
基本上,我们通过将indicator
列转换为包含指标和来源的列表项列来作弊,然后我们使用fun.aggregate
从具有最低来源的每个组中挑选项目值(注意我们重置因子,因此最理想的源具有最低级别)。