在R中重新整形长到宽的数据集时,有条件地填充缺失值

时间:2014-03-17 15:53:04

标签: r reshape missing-data reshape2

我正在根据不同质量的多个数据集构建一组年份和国家的完整指标时间表。

使用reshape2我已经"融化了#34;将这些数据集合并为一个数据帧。

示例数据集:

d <- structure(list(cntry = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("BE", 
"DE", "GE"), class = "factor"), year = c(1960L, 1970L, 1980L, 
1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 
1970L, 1960L, 1970L, 1960L, 1970L, 1970L, 1980L), indicator = c(5.5, 
1.2, 1.5, NA, 1.4, NA, NA, 5.5, 1.2, 2.3, 1.4, NA, 1.4, NA, NA, 
2.3, 1.4, 1.4, NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "male", class = "factor"), 
    source = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Council", 
    "Eurostat", "OECD"), class = "factor")), .Names = c("cntry", 
"year", "indicator", "sex", "source"), class = "data.frame", row.names = c(NA, 
-19L))


d
#    cntry year indicator  sex   source
# 1     BE 1960       5.5 male Eurostat
# 2     BE 1970       1.2 male Eurostat
# 3     BE 1980       1.5 male Eurostat
# 4     DE 1960        NA male Eurostat
# 5     DE 1970       1.4 male Eurostat
# 6     GE 1960        NA male Eurostat
# 7     GE 1970        NA male Eurostat
# 8     BE 1960       5.5 male     OECD
# 9     BE 1970       1.2 male     OECD
# 10    DE 1960       2.3 male     OECD
# 11    DE 1970       1.4 male     OECD
# 12    GE 1960        NA male     OECD
# 13    GE 1970       1.4 male     OECD
# 14    BE 1960        NA male  Council
# 15    BE 1970        NA male  Council
# 16    DE 1960       2.3 male  Council
# 17    DE 1970       1.4 male  Council
# 18    GE 1970       1.4 male  Council
# 19    GE 1980        NA male  Council

我希望我可以cast()使用fun.aggregate将此长数据集转换为宽格式,同时为特定国家/地区选择最高质量的数据集(Eurostat&gt; OECD&gt; Council)年份组合填补缺席。不幸的是,我真的不明白如何使用这样的自定义聚合函数。

换句话说,我希望将数据集从长格式重新整形为宽格式,同时根据因子的值合并多个值(&#34;源&#34;)。理想情况下它可以起作用:

full_data <- expand.grid(c('BE', 'GE', 'DE'), c('1960', '1970', '1980'))
full_data <- fill_missings(full_data, d, pref_order=c('Eurostat', 'OECD', 'Council'))
full_data
# BE 1960 5.5 male Eurostat
# BE 1970 1.2 male Eurostat
# BE 1980 1.5 male Eurostat
# DE 1960 2.3 male OECD
# DE 1970 1.4 male Eurostat
# DE 1980 NA  NA   NA
# GE 1960 NA  male Council 
# GE 1970 1.4 male OECD
# GE 1980 NA  male Council

并且可选地(或直接)采用宽格式:

# cntry  sex 1960 1970 1980
#    BE male  5.5  1.2  1.5
#    DE male  2.3  1.4  NA
#    GE male   NA  1.4  NA

4 个答案:

答案 0 :(得分:2)

假设数据符合您的要求,即source列首先由Eurostat排序,然后由OECD排序,然后由council排序,我和#39; d以这种方式使用data.table

require(data.table) # >= v1.9.0
setDT(d) # converts data.frame to data.table by reference
dcast.data.table(d, cntry + sex ~ year, value.var="indicator", 
 subset=.(!duplicated(d, by=c("cntry", "year", "indicator")) & !is.na(indicator)))

#    cntry  sex 1960 1970 1980
# 1:    BE male  5.5  1.2  1.5
# 2:    DE male  2.3  1.4   NA
# 3:    GE male   NA  1.4   NA

答案 1 :(得分:1)

我不确定这是否符合您的所有期望,但听起来您正在寻找以下内容:

toMerge <- expand.grid(cntry = c("BE", "DE", "GE"), 
                       year = c(1960, 1970, 1980), 
                       source = c("Eurostat", "OECD", "Council"), 
                       sex = "male")
d2 <- merge(d, toMerge, all = TRUE)

d2$source <- factor(d2$source, c("Council", "OECD", "Eurostat"), ordered=TRUE)
d2 <- d2[order(d2$source, decreasing=TRUE), ]
Rank <- with(d2, ave(indicator, d2[c("cntry", "year", "sex")], 
                 FUN = function(x) rank(x, ties.method="first", na.last=TRUE)))
D <- d2[Rank == 1, ]
D
#    cntry year  sex   source indicator
# 2     BE 1960 male Eurostat       5.5
# 5     BE 1970 male Eurostat       1.2
# 8     BE 1980 male Eurostat       1.5
# 14    DE 1970 male Eurostat       1.4
# 17    DE 1980 male Eurostat        NA
# 20    GE 1960 male Eurostat        NA
# 26    GE 1980 male Eurostat        NA
# 12    DE 1960 male     OECD       2.3
# 24    GE 1970 male     OECD       1.4

library(reshape2)
dcast(D, cntry ~ year, value.var="indicator")
#   cntry 1960 1970 1980
# 1    BE  5.5  1.2  1.5
# 2    DE  2.3  1.4   NA
# 3    GE   NA  1.4   NA

答案 2 :(得分:1)

也许以下内容也可以起作用:

library(reshape2)
x <- melt(d,id.vars=c("cntry","year","source","sex"))
y <- dcast(x,cntry+year+sex ~ source)
y$selected.value <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes=y$Council,no=y$OECD),no=y$Eurostat)
dcast(y,cntry + sex ~ year)

使用分层ifelse语句进行源选择。使用此方法会丢失所选源的指示,如果这是一个问题,可以添加类似的ifelse语句,从而创建源原始变量:

y$selected.source <- ifelse(is.na(y$Eurostat),yes=ifelse(is.na(y$OECD),yes="Council",no="OECD"),no="Eurostat")

答案 3 :(得分:0)

这是另一种选择:

library(reshape2)
d$source <- factor(d$source, levels=c('Eurostat', 'OECD', 'Council'))
d2 <- d[1:4]
d2[[3]] <- lapply(split(d, 1:nrow(d)), `[`, c(3, 5))
dcast(
  d2, cntry + sex ~ year, value.var="indicator", 
  fun.aggregate=function(x) {
    if(!length(x)) return(NA_real_)
    xs <- do.call(rbind, x)
    xs <- xs[complete.cases(xs), ]
    if(nrow(xs)) xs[order(as.numeric(xs$source)), "indicator"][[1L]] else NA_real_
} )

产地:

  cntry  sex  1960  1970  1980
1    BE male 105.5 101.2 101.5
2    DE male   2.3 101.4    NA
3    GE male    NA   1.4    NA

注意我在“Eurostat”值中添加了100,以使它们与其他值区别开来,因为在此示例集中它们似乎相等。

基本上,我们通过将indicator列转换为包含指标和来源的列表项列来作弊,然后我们使用fun.aggregate从具有最低来源的每个组中挑选项目值(注意我们重置因子,因此最理想的源具有最低级别)。