我有一个非常混乱的数据框(webscraped),遗憾的是它有许多双重甚至三重条目。大多数数据框看起来像这样:
> df1<-data.frame(var1=c("a","a","b","b","c","c","d","d"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d"))
> df1
var1 var2 var3
1 a right.a correct.a
2 a <NA> correct.a
3 b right.b correct.b
4 b <NA> correct.b
5 c right.c correct.c
6 c <NA> correct.c
7 d right.d correct.d
8 d <NA> correct.d
“var1”是我需要用来聚合的ID变量。我的目标是建立一个如下所示的数据框:
var1 var2 var3
1 a right.a correct.a
2 b right.b correct.b
3 c right.c correct.c
4 d right.d correct.d
然而,主要问题是,并非整个数据框看起来像这样。事实上,我有其他部分看起来像这样:
> df2<-data.frame(var1=c("e","e","e","f","f","g","g","g"),var2=c(NA,NA,"right.e",NA,NA,NA,"right.g",NA),var3=c("correct.e","correct.e",NA,"correct.f",NA,"correct.g","correct.g",NA))
> df2
var1 var2 var3
1 e <NA> correct.e
2 e <NA> correct.e
3 e right.e <NA>
4 f <NA> correct.f
5 f <NA> <NA>
6 g <NA> correct.g
7 g right.g wrong.g
8 g <NA> <NA>
和其他变化。最后,每个ID都应该有一行,其中包含正确的右侧var2和var3。在这一点上,我迷路了:我的var1 不唯一。但是,我知道“属于”的重复ID在数据框中分组(如我的示例所示);例如在行4102和4103中可能还有另一个“a”。
我认为可行的方法是使用带有var1的聚合作为ID,但另外告诉R聚合应该在执行此操作时检查+2行的var1。任何想法如何编码?
谢谢!
答案 0 :(得分:2)
以下是使用data.table
library(data.table)
setDT(df1)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
var1 V1 V2
1: a right.a correct.a
2: b right.b correct.b
3: c right.c correct.c
4: d right.d correct.d
和
setDT(df2)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
var1 V1 V2
1: e right.e correct.e
2: f NA correct.f
3: g right.g correct.g
var2[!is.na(var2)][1]
中的想法,例如,从var2获取第一个非缺失值。如果缺少所有值,则返回NA。通过var1对两个变量执行此操作。
如果您有两个以上的变量,则可以切换到lapply
。例如,以下内容。
df1[, lapply(.SD, function(i) i[!is.na(i)][1]), by=var1]
var1 var2 var3
1: a right.a correct.a
2: b right.b correct.b
3: c right.c correct.c
4: d right.d correct.d
在多个var1具有有效值并且由非缺失var2指示的情况下,您可以通过连接达到预期结果。
评论中的数据
df1<-data.frame(var1=c("a","a","b","b","c","c","d","d","a","a"),
var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA,"right.a1",NA),
var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d","correct.a1","correct.a1"))
然后,有了这些数据,
setDT(df1)[df1[, .(var2=var2[!is.na(var2)]), by=var1], on=.(var1, var2)]
var1 var2 var3
1: a right.a correct.a
2: a right.a1 correct.a1
3: b right.b correct.b
4: c right.c correct.c
5: d right.d correct.d
这里,var1的所有非缺失var2观察结果都合并到原始数据集上。
答案 1 :(得分:1)
如果var2
和var3
每个var1
级别只有一个唯一值,那么:
library(dplyr)
df = rbind(df1,df2)
df %>% group_by(var1) %>%
summarise_all(funs(.[!is.na(.)][1]))
var1 var2 var3 1 a right.a correct.a 2 b right.b correct.b 3 c right.c correct.c 4 d right.d correct.d 5 e right.e correct.e 6 f <NA> correct.f 7 g right.g correct.g