用选择标准减少高度相关的变量

时间:2017-05-02 12:00:11

标签: r correlation

我有一个大的不一致数据集,它有一堆高度相关的变量。我想要的是减少高于0.7的阈值的相关变量的数量。但是,我希望选定/剩余变量是与预定义变量具有最强相关性的变量。例如, x 作为选择变量的以下相关矩阵:

     x      y      z      m 
 x 1      0.1    0.2    0.3
 y 0.1    1      0.9    0.11   
 z 0.2    0.9    1      0.6
 m 0.3    0.11   0.60   1

应该简化为:

     x      z     m 
 x 1      0.2   0.3
 z 0.2      1   0.6
 m 0.3   0.60     1

因为 z y 超过0.7阈值且 z x 相关性强于 y

2 个答案:

答案 0 :(得分:0)

如果您选择的col为x,则查找低于某个点的值:

low = which(df[,1]<0.0.4) 

并使用

选择剩余的行/列
test[-low,-low]

答案 1 :(得分:0)

Cludgy,但似乎有效。

# Define matrix
mat <- matrix(c(1,0.1,0.2,0.3,0.1,1,0.9,0.11,0.2,0.9,1,0.6,0.3,0.11,0.60,1), ncol = 4)

# Add names
row.names(mat) <- colnames(mat) <- c("x", "y", "z", "m")

# Specify threshold
threshold <- 0.7

# Selected variable
i <- "x"

# Get column number of selected variable
i <- which(colnames(mat) == i)

# Find element above threshold
above.threshold <- matrix(which(abs(mat) > threshold & mat != 1, arr.ind = TRUE), ncol = 2)

# Remove duplicates
above.threshold <- above.threshold[above.threshold[,1]>above.threshold[,2],,drop = FALSE] 

# Variables to remove
var.rm <- apply(above.threshold, MAR = 1, function(foo)names(which.min(mat[i,foo])))

# New matrix
mat <- mat[!(rownames(mat) %in% var.rm), !(colnames(mat) %in% var.rm)]

那产生,

    x   z   m
x 1.0 0.2 0.3
z 0.2 1.0 0.6
m 0.3 0.6 1.0

如果阈值设置为0.5则会产生,

    x   m
x 1.0 0.3
m 0.3 1.0