使用矢量化通过Pandas将R函数caret :: findCorrelation转换为Python 3

时间:2017-01-20 10:27:12

标签: python r pandas r-caret

R函数caret::findCorrelation搜索相关矩阵并返回与变量对应的整数向量,如果删除这些变量,将减少其余变量之间的成对相关性。这是此函数的R代码:

function (x, cutoff = 0.9, verbose = FALSE, names = FALSE, exact = ncol(x) < 
100) 
 {
  if (names & is.null(colnames(x))) 
    stop("'x' must have column names when `names = TRUE`")
  out <- if (exact) 
    findCorrelation_exact(x = x, cutoff = cutoff, verbose = verbose)
  else findCorrelation_fast(x = x, cutoff = cutoff, verbose = verbose)
  out
  if (names) 
    out <- colnames(x)[out]
  out
}

函数findCorrelation_fast,这是我感兴趣的函数(删除了可选参数):

findCorrelation_fast <- function(x, cutoff = .90)
{
 if(any(!complete.cases(x)))
 stop("The correlation matrix has some missing values.")
 averageCorr <- colMeans(abs(x))
 averageCorr <- as.numeric(as.factor(averageCorr))
 x[lower.tri(x, diag = TRUE)] <- NA
 combsAboveCutoff <- which(abs(x) > cutoff)

 colsToCheck <- ceiling(combsAboveCutoff / nrow(x))
 rowsToCheck <- combsAboveCutoff %% nrow(x)

 colsToDiscard <- averageCorr[colsToCheck] > averageCorr[rowsToCheck]
 rowsToDiscard <- !colsToDiscard

 deletecol <- c(colsToCheck[colsToDiscard], rowsToCheck[rowsToDiscard])
 deletecol <- unique(deletecol)
 deletecol
}

我正在编写一个函数,在pandas的帮助下在Python 3中模拟了这个函数的意图。我的实现包含一个嵌套的for循环,我理解这远远不是实现所需结果的最有效方法。原始的R函数在没有任何循环的情况下完成工作。

我的两个问题是:

  1. 根据我在下面的实现,是否有一种Pythonic方法用向量化实现替换嵌套的for循环?
  2. 与(1)相关,R函数findCorrelation_fast使用该行 averageCorr <- as.numeric(as.factor(averageCorr))。这种结构对我来说似乎非常陌生,对于无环R实现的成功也至关重要。任何人都可以对这条线的作用有所了解吗?我的直觉告诉我,它非常聪明,并且利用了R.的独特行为。
  3. 我的Python实现及其用法示例:

    import numpy as np
    import pandas as pd
    
    # calculate pair-wise correlations
    
    def findCorrelated(corrmat, cutoff = 0.8):    
    
    ### search correlation matrix and identify pairs that if removed would reduce pair-wise correlations
    # args:
        # corrmat: a correlation matrix
        # cutoff: pairwise absolute correlation cutoff
    # returns:
        # variables to removed
    
        if(len(corrmat) != len(corrmat.columns)) : return 'Correlation matrix is not square'
        averageCorr = corrmat.abs().mean(axis = 1)
    
        # set lower triangle and diagonal of correlation matrix to NA
        corrmat = corrmat.where(np.triu(np.ones(corrmat.shape)).astype(np.bool))
        corrmat.values[[np.arange(len(corrmat))]*2] = None 
    
        # where a pairwise correlation is greater than the cutoff value, check whether mean abs.corr of a or b is greater and cut it
        to_delete = list()
        for col in range(0, len(corrmat.columns)):
            for row in range(0, len(corrmat)):
                if(corrmat.iloc[row, col] > cutoff):
                    if(averageCorr.iloc[row] > averageCorr.iloc[col]): to_delete.append(row)
                    else: to_delete.append(col)
    
        to_delete = list(set(to_delete))
    
        return to_delete
    
    # generate some data
    df = pd.DataFrame(np.random.randn(50,25))
    
    # demonstrate usage of function    
    removeCols = findCorrelated(df.corr(), cutoff = 0.01) #set v.low cutoff as data is uncorrelated
    print('Columns to be removed:')
    print(removeCols)
    uncorrelated = df.drop(df.index[removeCols], axis =1, inplace = False)
    print('Uncorrelated variables:')
    print(uncorrelated)
    

1 个答案:

答案 0 :(得分:0)

4 年前问过,但我正在寻找 Python 实现,它来自 R。

对于 2:as.numeric(as.factor(x)) 提供 x 的每个值的排名顺序。

as.factor() 分解相关值,按数字升序为它们分配级别,本质上是创建数字的字符但保持它们的相对顺序。

as.numeric() 然后将有序的级别/字符转换为数字,这样 x 的最低值现在是 1,最高值是 length(unique(x))。如果有平局,它们将具有相同的整数值,因为它们将被分配到与 as.factor() 相同的级别/等级。

请注意,此代码段不适用于矩阵,仅适用于向量。