Question

所以我想做一些非常简单的事情。循环数据帧并计算一对列之间的最大核心系数。

我想在R中这样做。

我的数据框已使用fread()

阅读

这是我的代码：我在首发时声明了max=-1, a=0和b=0。

for(i in 2:1933)
{
    for(j in i+1:1934)
    {
        if(is.numeric(data[[i]]) && is.numeric(data[[j]]))
        {
            if(isTRUE(sd(data[[i]], na.rm=TRUE) !=0) && isTRUE(sd(data[[j]], na.rm=TRUE) !=0))
            {
                c = cor(data[[i]], data[[j]], use="pairwise.complete.obs")
                if(isTRUE(c>=max))
                {
                    max = c
                    a = i
                    b = j
                }
            }
        }
    }
}

我得到的错误是

Error in .subset2(x, i, exact = exact) : subscript out of bounds

我确实有1934列，我无法弄清楚问题。我错过了一些相当明显的东西吗？

Answer 1

有一种更简单的方法：cor(...)采用矩阵（nr X nc）并返回一个新矩阵（nc X nc），其中每列的相关系数与其他列相对。其余的非常简单：

library(data.table)   # to simulate fread(...)
set.seed(1)           # for reproducibble example
dt <- as.data.table(matrix(1:50+rnorm(50,sd=5), ncol=5)) # create reproducible example


result <- cor(dt, use="pairwise.complete.obs")       # matrix of correlation coefficients
diag(result) <- NA                                   # set diagonals to NA
max(result, na.rm=TRUE)                              # maximum correlation coefficient
# [1] 0.7165304
which(result==max(result, na.rm=TRUE), arr.ind=TRUE) # location of max
#    row col
# V3   3   2
# V2   2   3

有两个位置，因为第2列和第3列之间的相关性与第3列和第2列之间的相关性相同。

Answer 2

试试这个:::

    drop_list <- NULL

#Guess the first column iS ID Column
feature.names <- names(data)[2:length(names(data)]

for(f in feature.names){
  if(sd(data[[f]], na.rm=TRUE) == 0.0 | is.numeric(data[[f]])==FALSE)
     {
     drop_list <- c(drop_list, f)
  }
}

data <- data[,!(names(data) %in% drop_list)]

corr_data <- cor(data, use="pairwise.complete.obs")


##remove Correlation between same variables
for(i in 1:dim(corr_data)[1]){corr_data[i,i] <- -99 }

#Please try to sort the correlation data.frame accordingly with which function as Howard suggested

干杯

R：下标超出界限时出错

2 个答案: