基于“近”行值进行聚合

时间:2017-03-29 17:45:06

标签: r dataframe aggregate na

我有一个非常混乱的数据框(webscraped),遗憾的是它有许多双重甚至三重条目。大多数数据框看起来像这样:

> df1<-data.frame(var1=c("a","a","b","b","c","c","d","d"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d"))
> df1
  var1    var2      var3
1    a right.a correct.a
2    a    <NA> correct.a
3    b right.b correct.b
4    b    <NA> correct.b
5    c right.c correct.c
6    c    <NA> correct.c
7    d right.d correct.d
8    d    <NA> correct.d

“var1”是我需要用来聚合的ID变量。我的目标是建立一个如下所示的数据框:

  var1    var2      var3
1    a right.a correct.a
2    b right.b correct.b
3    c right.c correct.c
4    d right.d correct.d

然而,主要问题是,并非整个数据框看起来像这样。事实上,我有其他部分看起来像这样:

> df2<-data.frame(var1=c("e","e","e","f","f","g","g","g"),var2=c(NA,NA,"right.e",NA,NA,NA,"right.g",NA),var3=c("correct.e","correct.e",NA,"correct.f",NA,"correct.g","correct.g",NA))
> df2
  var1    var2      var3
1    e    <NA> correct.e
2    e    <NA> correct.e
3    e right.e      <NA>
4    f    <NA> correct.f
5    f    <NA>      <NA>
6    g    <NA> correct.g
7    g right.g   wrong.g
8    g    <NA>      <NA>

和其他变化。最后,每个ID都应该有一行,其中包含正确的右侧var2和var3。在这一点上,我迷路了:我的var1 不唯一。但是,我知道“属于”的重复ID在数据框中分组(如我的示例所示);例如在行4102和4103中可能还有另一个“a”。

我认为可行的方法是使用带有var1的聚合作为ID,但另外告诉R聚合应该在执行此操作时检查+2行的var1。任何想法如何编码?

谢谢!

2 个答案:

答案 0 :(得分:2)

以下是使用data.table

的方法
library(data.table)

setDT(df1)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
   var1      V1        V2
1:    a right.a correct.a
2:    b right.b correct.b
3:    c right.c correct.c
4:    d right.d correct.d

setDT(df2)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
   var1      V1        V2
1:    e right.e correct.e
2:    f      NA correct.f
3:    g right.g correct.g

var2[!is.na(var2)][1]中的想法,例如,从var2获取第一个非缺失值。如果缺少所有值,则返回NA。通过var1对两个变量执行此操作。

如果您有两个以上的变量,则可以切换到lapply。例如,以下内容。

df1[, lapply(.SD, function(i) i[!is.na(i)][1]), by=var1]
   var1    var2      var3
1:    a right.a correct.a
2:    b right.b correct.b
3:    c right.c correct.c
4:    d right.d correct.d

在多个var1具有有效值并且由非缺失var2指示的情况下,您可以通过连接达到预期结果。

评论中的数据

df1<-data.frame(var1=c("a","a","b","b","c","c","d","d","a","a"),
                var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA,"right.a1",NA),
                var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d","correct.a1","correct.a1"))

然后,有了这些数据,

setDT(df1)[df1[, .(var2=var2[!is.na(var2)]), by=var1], on=.(var1, var2)]
   var1     var2       var3
1:    a  right.a  correct.a
2:    a right.a1 correct.a1
3:    b  right.b  correct.b
4:    c  right.c  correct.c
5:    d  right.d  correct.d

这里,var1的所有非缺失var2观察结果都合并到原始数据集上。

答案 1 :(得分:1)

如果var2var3每个var1级别只有一个唯一值,那么:

library(dplyr)

df = rbind(df1,df2)

df %>% group_by(var1) %>%
  summarise_all(funs(.[!is.na(.)][1]))
   var1    var2      var3
1     a right.a correct.a
2     b right.b correct.b
3     c right.c correct.c
4     d right.d correct.d
5     e right.e correct.e
6     f    <NA> correct.f
7     g right.g correct.g