R carret包中的警告findCorrelation"组合行和列位于截止值"而不是获取和返回值

时间:2015-08-21 18:12:18

标签: r correlation pattern-recognition r-caret

我目前正在尝试根据相关性过滤变量data 在我的Mac上使用来自R in RStudio的carret包。

到目前为止,我可以计算并打印数据集的相关性。但是,一旦我应用findCorrelation方法,我没有得到任何返回的数据。我只收到以下警告:

"组合行和列位于截止值之上,值=标记列"

library(caret)
preProcessAttributeClass <- function (data.convert) {
classe <- data.convert$classe
data.convert <- as.data.frame(sapply(data.convert,as.numeric))
data.convert$X.1 <- NULL
data.convert$X <- NULL
data.convert$user_name <- NULL
data.convert$raw_timestamp_part_1 <- NULL
data.convert$raw_timestamp_part_2 <- NULL
data.convert$cvtd_timestamp <- NULL
data.convert$new_window <- NULL
data.convert$num_window <- NULL
data.convert
}

data.train <- read.csv(file="training.csv",na.strings=c("NA",""))
data.train <- preProcessAttributeClass(data.train)
descrCor <- (cor(na.omit(data.train),use="complete.obs"))
highlyCorDescr <- findCorrelation(na.omit(descrCor), cutoff = .9,    verbose=TRUE,names=FALSE)

任何想法可能导致我的问题?

1 个答案:

答案 0 :(得分:2)

我认为问题是你的相关矩阵:

> class(na.omit(descrCor))
[1] "matrix"
> dim(na.omit(descrCor))
[1]   0 153

这些数据包含大量缺失数据的列:

> pct_na <- unlist(lapply(data.train, function(x) mean(is.na(x))))
> summary(pct_na)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.9793  0.6401  0.9793  0.9793 

我会纠正丢失约95%的列是否有用,但它们会阻止您获得有用的相关矩阵。我建议使用较少的列来进行相关过滤:

> sum(pct_na > .1)
[1] 100
> keepers <- data.train[,names(which(pct_na <= .1))]
> descrCor <- cor(keepers ,use="complete.obs")

其余大多数列都没有相关性或非常高:

> summary(descrCor[upper.tri(descrCor)])
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.992000 -0.108800  0.001911  0.001667  0.088680  0.980900 

现在进行过滤:

> highlyCorDescr <- findCorrelation(descrCor, cutoff = .9,    verbose=TRUE,names=FALSE)
Compare row 10  and column  1 with corr  0.992 
  Means:  0.266 vs 0.164 so flagging column 10 
Compare row 1  and column  9 with corr  0.925 
  Means:  0.247 vs 0.161 so flagging column 1 
Compare row 9  and column  4 with corr  0.928 
  Means:  0.229 vs 0.158 so flagging column 9 
Compare row 8  and column  2 with corr  0.966 
  Means:  0.24 vs 0.154 so flagging column 8 
Compare row 19  and column  18 with corr  0.918 
  Means:  0.089 vs 0.155 so flagging column 18 
Compare row 46  and column  31 with corr  0.914 
  Means:  0.099 vs 0.158 so flagging column 31 
Compare row 46  and column  33 with corr  0.933 
  Means:  0.081 vs 0.161 so flagging column 33 
All correlations <= 0.9 
> keep_these <- names(data.train)[!(names(data.train) %in% colnames(descrCor)[highlyCorDescr])]
> data.train.subset <- data.train[, keep_these]