R:如何删除仅一(两,三)列中与另一行不同的行?

时间:2018-07-05 11:00:08

标签: r

我有一个类似于以下示例的数据框。有时,除了一个(或几个)列(包含“ NA”)外,行包含与另一行相同的对象信息。我只希望包含尽可能多信息的行,所以我想删除所有包含“ NA”但其他信息与另一行相同的行。 “ NA”可以在C或D列中,也可以在两者中(永远不在A或B中)。如果没有“更准确”的行,则必须保留包含“ NA”的行。

我已经尝试过使用for循环(请参见示例),它可以工作,第1行和第6行将被删除。但是,我将不得不对其进行调整以检查C列,并且在我的实际数据中,我还有更多列,因此还有更多可能的组合,这使该解决方案不切实际,并可能导致错误。

还有其他方法可以轻松解决此问题吗? 谢谢!

df <- rbind(data.frame(A = "obj1", B = "1", C = "2", D = "NA"), 
            data.frame(A = "obj1", B = "1", C = "2", D = "3"),
            data.frame(A = "obj2", B = "1", C = "NA", D = "3"),
            data.frame(A = "obj2", B = "1", C = "2", D = "3"),
            data.frame(A = "obj2", B = "3", C = "2", D = "3"),
            data.frame(A = "obj2", B = "3", C = "2", D = "NA"),
            data.frame(A = "obj3", B = "2", C = "4", D = "6"),
            data.frame(A = "obj4", B = "2", C = "NA", D = "NA"))

toBeDeleted <- c(55)

for (i in 1:nrow(df)){
  thisRow <- df[i,]

  if (thisRow$D == "NA"){
    for (j in i:nrow(subset(df, A == thisRow$A))){
      anotherRow <- df[j,]
      if (anotherRow$A == thisRow$A & anotherRow$B == thisRow$B 
          & anotherRow$C == thisRow$C & anotherRow$D != thisRow$D){
        toBeDeleted <- c(toBeDeleted,i)
      }
    }
  }
}

df2 <- df[-toBeDeleted,]

1 个答案:

答案 0 :(得分:1)

我们可以结合使用duplicated(df[1:2])duplicated(df[1:2], fromLast = TRUE)rowSums(is.na(df)) > 0来排除所有具有NA且重复的行:

df <- rbind(data.frame(A = "obj1", B = "1", C = "2", D = NA), 
            data.frame(A = "obj1", B = "1", C = "2", D = "3"),
            data.frame(A = "obj2", B = "1", C = NA, D = "3"),
            data.frame(A = "obj2", B = "1", C = "2", D = "3"),
            data.frame(A = "obj2", B = "3", C = "2", D = "3"),
            data.frame(A = "obj2", B = "3", C = "2", D = NA),
            data.frame(A = "obj3", B = "2", C = "4", D = "6"),
            data.frame(A = "obj4", B = "2", C = NA, D = NA))

df[!((duplicated(df[1:2]) | duplicated(df[1:2], fromLast = TRUE)) & rowSums(is.na(df)) > 0),]

     A B    C    D
2 obj1 1    2    3
4 obj2 1    2    3
5 obj2 3    2    3
7 obj3 2    4    6
8 obj4 2 <NA> <NA>

这是一个简单的子集,因此不需要循环,即使有大量数据也非常快。它是这样的:

我们将数据称为df[],并用!()排除在前两列df[1:2]上具有重复项并且至少具有一个NA值{{1 }}。为此,您需要在数据中使用真实的rowSums(is.na(df)) > 0,而不是上面示例数据中的NA character。如果只有"NA",请改用"NA"