合并数据帧以消除缺失的观察结果

时间:2013-04-05 00:25:32

标签: r merge

我有两个数据框。一个(df1)包含所有感兴趣的列和行,但包含缺少的观察值。另一个(df2)包含用于代替缺失观察值的值,并且仅包括NA中至少有一个df1的列和行。我想以某种方式合并两个数据集以获得desired.result

这似乎是一个非常简单的问题需要解决,但我正在画一个空白。我无法让merge工作。也许我可以编写嵌套的for-loops,但还没有这样做。我也曾尝试aggregate几次。我有点害怕发布这个问题,担心我的R卡可能会被撤销。对不起,如果这是重复的。我在这里和谷歌进行了相当密切的搜索。谢谢你的任何建议。基础R中的解决方案更可取。

df1 = read.table(text = "
  county year1 year2 year3
    aa     10    20   30
    bb      1    NA    3
    cc      5    10   NA
    dd    100    NA  200
", sep = "", header = TRUE)

df2 = read.table(text = "
  county year2 year3
    bb      2   NA
    cc     NA   15
    dd    150   NA
", sep = "", header = TRUE)

desired.result = read.table(text = "
  county year1 year2 year3
    aa     10    20   30
    bb      1     2    3
    cc      5    10   15
    dd    100   150  200
", sep = "", header = TRUE)

3 个答案:

答案 0 :(得分:9)

aggregate可以执行此操作:

aggregate(. ~ county,
          data=merge(df1, df2, all=TRUE), # Merged data, including NAs
          na.action=na.pass,              # Aggregate rows with missing values...
          FUN=sum, na.rm=TRUE)            # ...but instruct "sum" to ignore them.
##   county year2 year3 year1
## 1     aa    20    30    10
## 2     bb     2     3     1
## 3     cc    10    15     5
## 4     dd   150   200   100

答案 1 :(得分:2)

这样做:

m <- merge(df1, df2, by="county", all=TRUE)

dotx <- m[,grepl("\\.x",names(m))]

doty <- m[,grepl("\\.y",names(m))]

dotx[is.na(dotx)] <- doty[is.na(dotx)]

names(dotx) <- sapply(strsplit(names(dotx),"\\."), `[`, 1)

result <- cbind(m[,!grepl("\\.x",names(m)) & !grepl("\\.y",names(m))], dotx)

检查:

> result
  county year1 year2 year3
1     aa    10    20    30
2     bb     1     2     3
3     cc     5    10    15
4     dd   100   150   200

答案 2 :(得分:2)

使用reshape2并使用长格式的另一个选项:

library(reshape2)
## reshape to long format
df1.m <- melt(df1)
df2.m <- melt(df2)
## get common values
idx <- df1.m$county %in% df2.m$county & 
       df1.m$variable%in% df2.m$variable
## replace NA values 
df1.m[idx,]$value <- ifelse(is.na(df1.m[idx,]$value),
                            df2.m$value , 
                            df1.m[idx,]$value)
## get the wide format
dcast(data=df1.m,county~variable)

  county year1 year2 year3
1     aa    10    20    30
2     bb     1     2     3
3     cc     5    10    15
4     dd   100   150   200