Question

我想从数据帧中删除重复的列，而不考虑NA。数据帧的所有列都是长度相等的数字向量。这是一个示例：

> df <- data.frame(a = c(1,2,NA,4,4), b= c(5,6,7,8,8), c= c(5,6,7,8,8), d = c(9,8,7,6,NA), e = c(NA,8,7,6,6))
> df
   a b c  d  e
1  1 5 5  9 NA
2  2 6 6  8  8
3 NA 7 7  7  7
4  4 8 8  6  6
5  4 8 8 NA  6

我希望得到此数据框：

> df_clear
   a b d
1  1 5 9
2  2 6 8
3 NA 7 7
4  4 8 6

我尝试“独特”，但没有成功。只有没有NA的重复项被删除了。

> df_clear <- 
+   df %>%
+     unique %>%
+     t %>%
+     as.matrix %>%
+     unique %>%
+     t %>%
+     as.data.frame
> df_clear
   a b  d  e
1  1 5  9 NA
2  2 6  8  8
3 NA 7  7  7
4  4 8  6  6
5  4 8 NA  6

dplyr的

“与众不同”也没有帮助。通过这种方法，我什至丢失了列名。

> df_clear <- 
+   df %>%
+     distinct %>%
+     t %>%
+     as.data.frame %>%
+     distinct %>%
+     t %>%
+     as.data.frame
> df_clear
   V1 V2 V3 V4
V1  1  5  9 NA
V2  2  6  8  8
V3 NA  7  7  7
V4  4  8  6  6
V5  4  8 NA  6

我想知道是否有任何功能可以完成这项工作，或者我应该自己编写它。实际的数据帧有1000多个行和列。

非常感谢您的帮助！

编辑

阅读评论后，我意识到我对原始问题的定义不足。这里有一些澄清。为了简单起见，我只关注行：
-如果重复，则剩余的行应包含尽可能少的NA。例如。 df1应该显示为df1_clear

> df1
   a b  d e
1  1 4  7 1
2  3 6 NA 3
3  2 5  8 2
4 NA 6  9 3
> df1_clear
  a b d e
1 1 4 7 1
2 2 5 8 2
3 3 6 9 3

重复项不一定是连续的。
连续可能不止一个NA。

Answer 1

以下内容有些复杂，但是可以完成工作。
它会在fun中调用两次函数，以删除原始数据帧的重复项，然后删除其转置项。

fun <- function(DF){
  f <- function(DF1){
    df1 <- DF1
    df1[] <- lapply(df1, function(x){
      y <- zoo::na.locf(x)
      if(length(y) < length(x)) y <- zoo::na.locf(x, fromLast = TRUE)
      y
    })
    DF1[!duplicated(df1), ]
  }
  df2 <- f(DF)
  df2 <- as.data.frame(t(df2))
  df2 <- t(f(df2))
  as.data.frame(df2)
}

fun(df)
#   a b d
#1  1 5 9
#2  2 6 8
#3 NA 7 7
#4  4 8 6

基于以上所述，可以使用f()和fun管道中的函数dplyr来完成此操作。下面的函数f()只是上面函数的复制和粘贴。

library(dplyr)


f <- function(DF1){
  df1 <- DF1
  df1[] <- lapply(df1, function(x){
    y <- zoo::na.locf(x)
    if(length(y) < length(x)) y <- zoo::na.locf(x, fromLast = TRUE)
    y
  })
  DF1[!duplicated(df1), ]
}


df %>%
  f() %>% t() %>% as.data.frame() %>%
  f() %>% t() %>% as.data.frame()

#   a b d
#1  1 5 9
#2  2 6 8
#3 NA 7 7
#4  4 8 6

如何从不考虑NA的数据帧中删除重复的行和列？

1 个答案: