Question

我有一个如下数据框。我想找到唯一的行（唯一性行）。

但是在这些数据中，我有＆＃39; NA＆＃39;作为缺失的数据。因此，NA可以像其他行一样获得任何值。例如：在行c6中，可能NA在列a2中得到0,1或2，或者在行c8中可能NA在列a3中得到0或1。

另一方面，在行1,2和6中，除了NA之外的所有值都是相同的，因为NA可以是值＆＃39; 0或1＆＃39;，我想删除此行并保留第2行

此外，在行c6中，列a1和a3（不包括NA列）与行c2和c5相同，并且c6中的可能NA与c2和c5相同，因此该行不是唯一的。 / p>

数据：

      a1  a2   a3   a4
c1    2    1    0   NA
c2    2    1    0    0
c3    2    1    1    0
c4    2    2    0   NA
c5    2    1    0    0
c6    2    NA   0   NA
c7    1    NA   0   NA
c8    2    0   NA   NA

我想有这个输出：

输出：

     a1    a2  a3   a4
c2    2    1    0    0
c3    2    1    1    0
c4    2    2    0   NA
c7    1    NA   0   NA
c8    2    0   NA   NA

另外，@ Sotos解决方案对我提供了更多帮助，但在最后一部分中，在为行创建模式时删除NA后，他的解决方案考虑了c8和c6的相同模式（23）并将其删除。但实际上c8是独一无二的。而且，C7是独一无二的但却无视它。

c1 <- c( 2,1,0,NA)
c2<-c( 2,1,0,0)
c3<-c(2,1,1,0)
c4<-c(2, 2,0,NA)
c5<-c( 2,1,0,0)
c6<-c(2,NA,0,NA)
c7<-c(1,NA,0,NA)
c8 <-c(2,0,NA,NA)

df<-as.data.frame(rbind(c1,c2,c3,c4,c5,c6,c7,c8))

library(stringr) 

df <- unique(df)
df$new <- apply(df, 1, function(i) paste(na.omit(i), collapse = ''))
df$new2 <- rowSums(sapply(df$new, function(i) str_detect(i, df$new)))
new_df <- subset(df, df$new2 == 1)
new_df <- new_df[, !names(new_df) %in% c('new', 'new2')]
new_df

Answer 1

嗯，它不是R风格的解决方案，在应用于更大的数据框架时可能会变慢......

uniq = F
i = 1
while(!uniq){
  to.remove = vector()
  for(j in setdiff(1:nrow(df), i)){ # setdiff avoid i and j beeing identical
    row.idx = intersect(which(!is.na(df[i, ])), # get all indices which not NA
                        which(!is.na(df[j, ]))  # in both columns
    )
    if(all(df[i, row.idx] == df[j, row.idx])) # match the rows, only using the
      to.remove = union(to.remove, j)         # columns identified before
  }
  if(length(to.remove) == 0){
    uniq = T
  } else {
    df = df[-to.remove,]
    i = i + 1
  }
}

使用NA查找行的唯一性？

1 个答案: