具有缺失值的Cramer's V会产生不同的结果

时间:2015-03-20 19:24:42

标签: r correlation

我的问题涉及计算Cramers V以检测分类变量之间的相关性。我有一个缺少值的数据集,但是我创建了一个假数据集,用于说明两个变量a和b,其中一个变量包含NA。

 a <-  factor(c("M","","F","F","","M","F","F"))
a2 <- factor(a, levels = c('M','F'),labels =c('Male','Female'))
b <- factor(c("y","y","","y","n","n","n","y"))
b2 <- factor(b, levels=c("y","n"),labels=c("yes","no"))
df<-cbind(a2,b2)

assocstats函数给出了cramers V的结果:

require(vcd)
> tab <-table(a,b)
> assocstats(tab)
                            X^2 df P(> X^2)
Likelihood Ratio 1.7261  4  0.78597
Pearson          1.3333  4  0.85570

Phi-Coefficient   : 0.408 
Contingency Coeff.: 0.378 
Cramer's V        : 0.289

现在我想从级别

中删除NA
   a[a==""]<-NA
    a3 <- droplevels(a)
    levels(a3)

    tab <-table(a,b)
    assocstats(tab)

但每次我删除NA时,结果都是这样的:

                    X^2 df P(> X^2)
Likelihood Ratio 0.13844  2  0.93312
Pearson              NaN  2      NaN

Phi-Coefficient   : NaN 
Contingency Coeff.: NaN 
Cramer's V        : NaN 

另外,因为我有一个大型数据集,我想计算Cramer V结果的矩阵。我发现这个代码在堆栈溢出,它似乎工作...

get.V<-function(y){
  col.y<-ncol(y)
  V<-matrix(ncol=col.y,nrow=col.y)
  for(i in 1:col.y){
    for(j in 1:col.y){
      V[i,j]<-assocstats(table(y[,i],y[,j]))$cramer
    }
  }
  return(V)
}

   get.V(tab)

只有结果与使用assocstats函数的结果不同:

   [,1] [,2] [,3]
[1,]  1.0  0.5    1
[2,]  0.5  1.0    1
[3,]  1.0  1.0    1

这可能不对,因为我每次都会得到这个结果,即使改变了观察次数......这段代码有什么问题?

结论:我不知道哪一个结果是对的。我有一个包含大量NA的大型数据集。第一个asocstat结果和代码给出不同的结果,尽管没有太大的区别,因为代码只创建一个矩阵。第二个asocstat函数只给出NaN.I无法检测到任何错误......有人可以帮助我吗?

1 个答案:

答案 0 :(得分:0)

如果您使用因素,则不必将""替换为NA - 您在levels中未定义的任何唯一值都将是NA

转换为factor
a <-  factor(c("M","","F","F","","M","F","F"))
a2 <- factor(a, levels = c('M','F'),labels =c('Male','Female'))

a
# [1] M   F F   M F F
# Levels:  F M
a2
# [1] Male   <NA>   Female Female <NA>   Male   Female Female
# Levels: Male Female

b <- factor(c("y","y","","y","n","n","n","y"))
b2 <- factor(b, levels=c("y","n"),labels=c("yes","no"))


(df <- cbind(a2,b2))

#      a2 b2
# [1,]  1  1
# [2,] NA  1
# [3,]  2 NA
# [4,]  2  1
# [5,] NA  2
# [6,]  1  2
# [7,]  2  2
# [8,]  2  1

在上方,您正在创建一个矩阵,该矩阵会丢失您使用factor创建的所有标签。我想你想要一个数据框:

(df <- data.frame(a2,b2))

#       a2   b2
# 1   Male  yes
# 2   <NA>  yes
# 3 Female <NA>
# 4 Female  yes
# 5   <NA>   no
# 6   Male   no
# 7 Female   no
# 8 Female  yes


require('vcd')
(tab <- table(a2,b2, useNA = 'ifany'))
#          b2
# a2       yes no <NA>
#   Male     1  1    0
#   Female   2  1    1
#   <NA>     1  1    0

(tab <- table(a2,b2))
#          b2
# a2       yes no
#   Male     1  1
#   Female   2  1

如果要在表格中查看table值,则需要明确告诉NA。否则,它将默认删除它们,以便您已经&#34;排除&#34;使用assocstats

时使用它们
assocstats(tab)

#                      X^2 df P(> X^2)
# Likelihood Ratio 0.13844  1  0.70983
# Pearson          0.13889  1  0.70939
# 
# Phi-Coefficient   : 0.167 
# Contingency Coeff.: 0.164 
# Cramer's V        : 0.167 

对于get.V,只传递数据框或矩阵,而不是表:

get.V <- function(y) {
  col.y <- ncol(y)
  V <- matrix(ncol=col.y,nrow=col.y)
  for(i in 1:col.y){
    for(j in 1:col.y){
      V[i,j] <- assocstats(table(y[,i],y[,j]))$cramer
    }
  }
  return(V)
}


get.V(df)

#           [,1]      [,2]
# [1,] 1.0000000 0.1666667
# [2,] 0.1666667 1.0000000