更新（基于评论中的对话）

Question

这是我到目前为止所做的：

  testdf=olddf;

for (i in colnames(testdf))

  if (length(unique(testdf[,i]))==1){

    testdf[,-(i)]

    }

我不能使上述代码有效。有人可以帮助建议我做错了什么吗？本质上，我试图以一种方式进行循环，以便检查每一列以确保没有唯一的数据。例如，如果列长度等于1，则必须将其删除。

谢谢

Answer 1

您不能使用 - 运算符来索引字符列名称。一种方法是使用哪种方法。这应该适用于你的情况。

for (i in colnames(testdf)) {
  if (length(unique(testdf[,i])) == 1) {
    testdf<- testdf[,-which(colnames(testdf) == i) ]
  }
}

Answer 2

在R中，如果可以的话，最好避免使用for循环。并不是说它们应该一起避免，但矢量化操作往往更快。在这种情况下，sapply是您的朋友。

df = data.frame(v1=sample(letters, 10), v2=sample(1:100, 10), v3=4, v4=sample(LETTERS, 10))
x = sapply(names(df), function(x) length(unique(df[[x]])) > 1)
df[, x]
#    v1 v2 v4
# 1   e 82  P
# 2   i 45  T
# 3   z 76  W
# 4   u 27  Y
# 5   n  2  Q
# 6   x 72  B
# 7   o 61  O
# 8   d 47  R
# 9   s 42  G
# 10  k 66  S

更新（基于评论中的对话）

# This line of code identifies the columns that are both numeric
# and have values where max != min
good_cols = sapply(testdf, function(x) {
    is.numeric(x) && ((max(x) - min(x)) > 0)
})

# Subset the original data to just the good columns for modeling
model_df = testdf[, good_cols]

# Run the regression
lm(y ~ ., data = model_df)

Answer 3

在较高的层面上，要使其工作，您需要重新分配testdf变量。目前你只需选择它。即，将testdf[, -(i)]替换为testdf <- testdf[, -(i)]。

此外，您遍历名称，并且不能使用括号中的-运算符来取消选择特定名称。但是，您可以使用索引，但是如果您确实使用索引并在for循环中重新分配testdf，则可能会删除for循环的列，并最终尝试引用不具有索引的索引。 testdf中存在更长的时间。

我建议使用dplyr选择器函数（请参阅?select_if），这将有所帮助。见下面的例子：

library(dplyr)
temp <- mtcars %>% filter(cyl == 6)
temp %>% select_if(~length(unique(.)) > 1)
   mpg  disp  hp drat    wt  qsec vs am gear carb
1 21.0 160.0 110 3.90 2.620 16.46  0  1    4    4
2 21.0 160.0 110 3.90 2.875 17.02  0  1    4    4
3 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1
4 18.1 225.0 105 2.76 3.460 20.22  1  0    3    1
5 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4
6 17.8 167.6 123 3.92 3.440 18.90  1  0    4    4
7 19.7 145.0 175 3.62 2.770 15.50  0  1    5    6

如何删除循环中具有1个唯一变量的列？

3 个答案:

更新（基于评论中的对话）