合并具有公共列值的data.frame行

时间:2013-11-06 17:56:53

标签: r merge dataframe

请告诉我如何转换数据框如下:

    tg  qr  loc a1  a2  a3  b1  b2  b3  c1  c2  c3
1   A   1   89  NA  NA  NA  1   2   3   1   2   3
2   A   1   61  1   2   3   NA  NA  NA  1   2   3
3   A   2   38  4   5   6   NA  NA  NA  NA  NA  NA
4   B   1   40  4   5   6   NA  NA  NA  NA  NA  NA
5   B   1   3   NA  NA  NA  NA  NA  NA  4   5   6

进入这个:

    tg  qr  loc a1  a2  a3  b1  b2  b3  c1  c2  c3
1   A   1   15  1   2   3   1   2   3   1   2   3
2   A   2   95  4   5   6   NA  NA  NA  NA  NA  NA
3   B   1   42  4   5   6   NA  NA  NA  4   5   6

该功能应该:

  • 将列'tg'和'qr'中具有相同值的所有行合并为一行
  • 在合并时,用现有值替换所有“NA” - 从不反方向
  • 通常会出现这样的情况:当合并的两个行中都存在一个变量,但它的值总是相等的(那么从哪一行开始并不重要)
  • 'loc'列值不同,但不相关,列甚至可以删除

这些示例数据帧的代码是:

df = rbind(c("A","1",floor(runif(1,1,100)),c(NA,NA,NA),c(1,2,3),c(1,2,3)),
           c("A","1",floor(runif(1,1,100)),c(1,2,3),c(NA,NA,NA),c(1,2,3)),
           c("A","2",floor(runif(1,1,100)),c(4,5,6),c(NA,NA,NA),c(NA,NA,NA)),
           c("B","1",floor(runif(1,1,100)),c(4,5,6),c(NA,NA,NA),c(NA,NA,NA)),
           c("B","1",floor(runif(1,1,100)),c(NA,NA,NA),c(NA,NA,NA),c(4,5,6)))
df = as.data.frame(df)
colnames(df) = c("target","query","loc",c("a1","a2","a3"),c("b1","b2","b3"),c("c1","c2","c3"))

df2 = rbind(c("A","1",floor(runif(1,1,100)),c(1,2,3),c(1,2,3),c(1,2,3)),
            c("A","2",floor(runif(1,1,100)),c(4,5,6),c(NA,NA,NA),c(NA,NA,NA)),
            c("B","1",floor(runif(1,1,100)),c(4,5,6),c(NA,NA,NA),c(4,5,6)))
df2 = as.data.frame(df2)
colnames(df2) = c("target","query","loc",c("a1","a2","a3"),c("b1","b2","b3"),c("c1","c2","c3"))

感谢您的支持。

2 个答案:

答案 0 :(得分:2)

使用na.omit

library(data.table)
dt = data.table(df)

dt[, lapply(.SD, function(x) na.omit(x)[1]), by = list(target, query)]
#   target query loc a1 a2 a3 b1 b2 b3 c1 c2 c3
#1:      A     1  21  1  2  3  1  2  3  1  2  3
#2:      A     2  71  4  5  6 NA NA NA NA NA NA
#3:      B     1  25  4  5  6 NA NA NA  4  5  6

答案 1 :(得分:1)

这样的事可能吗?

library(data.table)
dt <- data.table(df)
dt <- dt[,lapply(.SD, as.numeric), by = c("target","query")]
dt2 <- dt[,lapply(.SD, mean, na.rm = TRUE), by = c("target","query")]
dt2[is.na(dt2)] <- NA

DT2

> dt2
   target query loc a1 a2 a3 b1 b2 b3 c1 c2 c3
1:      A     1 2.0  1  1  1  1  1  1  1  1  1
2:      A     2 2.0  2  2  2 NA NA NA NA NA NA
3:      B     1 2.5  2  2  2 NA NA NA  2  2  2