我有一个大的不一致数据集,它有一堆高度相关的变量。我想要的是减少高于0.7的阈值的相关变量的数量。但是,我希望选定/剩余变量是与预定义变量具有最强相关性的变量。例如, x 作为选择变量的以下相关矩阵:
x y z m
x 1 0.1 0.2 0.3
y 0.1 1 0.9 0.11
z 0.2 0.9 1 0.6
m 0.3 0.11 0.60 1
应该简化为:
x z m
x 1 0.2 0.3
z 0.2 1 0.6
m 0.3 0.60 1
因为 z 和 y 超过0.7阈值且 z 与 x 相关性强于 y
答案 0 :(得分:0)
如果您选择的col为x,则查找低于某个点的值:
low = which(df[,1]<0.0.4)
并使用
选择剩余的行/列test[-low,-low]
答案 1 :(得分:0)
Cludgy,但似乎有效。
# Define matrix
mat <- matrix(c(1,0.1,0.2,0.3,0.1,1,0.9,0.11,0.2,0.9,1,0.6,0.3,0.11,0.60,1), ncol = 4)
# Add names
row.names(mat) <- colnames(mat) <- c("x", "y", "z", "m")
# Specify threshold
threshold <- 0.7
# Selected variable
i <- "x"
# Get column number of selected variable
i <- which(colnames(mat) == i)
# Find element above threshold
above.threshold <- matrix(which(abs(mat) > threshold & mat != 1, arr.ind = TRUE), ncol = 2)
# Remove duplicates
above.threshold <- above.threshold[above.threshold[,1]>above.threshold[,2],,drop = FALSE]
# Variables to remove
var.rm <- apply(above.threshold, MAR = 1, function(foo)names(which.min(mat[i,foo])))
# New matrix
mat <- mat[!(rownames(mat) %in% var.rm), !(colnames(mat) %in% var.rm)]
那产生,
x z m
x 1.0 0.2 0.3
z 0.2 1.0 0.6
m 0.3 0.6 1.0
如果阈值设置为0.5则会产生,
x m
x 1.0 0.3
m 0.3 1.0