Question

如何使用data.table删除重复的列？（只保留其中一个）

我知道有关于重复列的其他问题，但他们只检查重复的列名称，而不是内容，

我想要的是查找具有不同名称但内容相同的列。

此致

Answer 1

这是特征工程中的常见任务。以下代码块是我和Kaggle社区为此目的开发的：

##### Removing identical features
features_pair <- combn(names(train), 2, simplify = F) # list all column pairs
toRemove <- c() # init a vector to store duplicates
for(pair in features_pair) { # put the pairs for testing into temp objects
  f1 <- pair[1]
  f2 <- pair[2]

  if (!(f1 %in% toRemove) & !(f2 %in% toRemove)) {
    if (all(train[[f1]] == train[[f2]])) { # test for duplicates
      cat(f1, "and", f2, "are equals.\n")
      toRemove <- c(toRemove, f2) # build the list of duplicates
    }
  }
}

然后你可以删除你想要的副本的任何副本。默认情况下，我使用存储在临时对象f2中的版本并将其删除，如下所示：

train <- train[,!toRemove]

如何删除data.table R中的重复列（内容）？

1 个答案: