Question

假设我有大约500个可用变量，我正在尝试为我的模型做变量选择（响应是二进制）

我打算对所有连续进行某种corr分析，然后再进行分类。

由于涉及很多变量，我不能手动完成。

是否有可以使用的功能？或者是一个模块？

Answer 1

我在iris中使用R数据集avaialbe。然后

sapply(iris, is.factor)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
       FALSE        FALSE        FALSE        FALSE         TRUE

会告诉你天气你的列是否因素。所以使用

iris[ ,sapply(iris, is.factor)]

您只能选择系数列。和

iris[ ,!sapply(iris, is.factor)]

会给你那些不是因素的列。您还可以使用is.numeric，is.character和其他不同版本。

Answer 2

您可以使用str(df)查看哪些列是因素，哪些列不是（df是您的数据帧）。例如，对于R中的数据光圈：

str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

或者，您可以使用lapply(iris,class)

$Sepal.Length
[1] "numeric"

$Sepal.Width
[1] "numeric"

$Petal.Length
[1] "numeric"

$Petal.Width
[1] "numeric"

$Species
[1] "factor"

Answer 3

创建一个函数，该函数返回逻辑，其中唯一值的数量小于总数的一部分，并且我选择了5％：

 discreteL <- function(x) length(unique(x)) < 0.05*length(x)

现在sapply它（对连续变量否定）到data.frame：

 > str( iris[ , !sapply(iris, discreteL)] )
'data.frame':   150 obs. of  4 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

你可以选择一个特定的数字，比方说15，作为我的标准。

我应该明确统计理论认为这个程序对于所概述的目的是危险的。只选择与二进制响应最相关的变量并不是很好。已经有许多研究表明了更好的变量选择方法。所以我的答案实际上只是如何进行分离，而不是对你模糊描述的整体计划的认可。

有没有一种简单的方法可以将分类变量和连续变量分成R中的两个数据集

3 个答案: