计算多个变量的Kolmogorov Smirnov统计

时间:2019-04-06 20:43:43

标签: r

我是R编码领域的初学者。 我想为多个变量计算KS stat,从而变量在excel中。 我一直在尝试将计算从一个变量循环到另一个变量,并将KS结果存储在数据帧中。 我没有问题可以为单个变量计算KS统计信息FINALGRADE。 以下是我具有多个变量的数据tmp。

问题:如何获取KS r代码以从一个变量计算到另一个变量,并将KS结果存储在数据框中?

KS统计信息是默认客户和非默认客户得分之间的差异。

    ID  Default FINALGRADE  FINALSCORE  PREOVERRIDESCORE    SUBJECTIVESCORE FINANCIALSCORE
    10009011    0   8   67.65854557 67.65854557 68.36424313 60.2136826
    10020003    0   7   72.18560889 72.18560889 70.97483009 64.35831722
    10020003    0   6   77.23072833 77.23072833 69.87370952 71.53180821
    10021201    0   14  40.21338437 40.21338437 58.06865599 40.54564338
    10021201    0   8   68.79085151 68.79085151 72.59254723 58.91827403
    10022730    0   4   84.47284986 84.47284986 78.03588557 77.85944161
    10022731    0   5   78.28775535 78.28775535 82.07915713 64.45948626
    10025555    0   15  7.907947702 7.907947702 57.95049201 4.075100629
    10025555    0   13  1.75            47.15981982 72.56744037 39.16338519
    10025763    0   15  66.39063143 66.39063143 79.10054245 52.66288527
    10029315    1   14  40.36515221 40.36515221 57.9586825  40.78027744
    10030999    0   17  25.78498104 25.78498104 84.37428799 16.36896422
    10030999    0   13  47.90043592 47.90043592 78.97405559 36.28646008
    10033303    0   10  58.50724135 58.50724135 74.95635833 47.05689989
    10033938    0   15  32.79988473 37.79988473 45.90931406 43.84648718
    10039393    1   8   67.31395864 67.31395864 74.81030489 55.26979858
    10039780    0   9   64.94318991 69.94318991 69.44595762 62.06825469
    10040777    0   13  44.93908421 44.93908421 81.83346015 32.38398138
    10041213    0   15  33.05768436 33.05768436 73.75578861 27.6882957
    10041213    0   15  35.39463308 35.39463308 73.75578861 28.95912606
    10045566    1   8   70.60067856 70.60067856 70.87753432 61.88535995
    10045566    0   10  58.50956434 58.50956434 70.87753432 49.89960356
    10045692    0   12  50.52222802 50.52222802 50.91083454 52.10279587
    10045692    0   10  59.17371704 59.17371704 57.49697166 57.37504351
    10046390    1   10  60.47796914 60.47796914 67.94551866 52.29460738
    10047830    0   12  51.46066369 51.46066369 79.14482394 39.16019407
    10048824    0   13  50.86887099 50.86887099 65.6366083  46.18752406
    10048824    0   12  49.82958553 49.82958553 60.56566557 47.97788939
    10050504    0   8   67.47839481 67.47839481 72.53163793 58.4371572
    10050504    0   7   73.7608865  73.7608865  69.49809267 67.26984194
# calculate KS
> n_S <- length(tmp$FINALGRADE)

> d <- sum(tmp$Default)

> g <- sum(tmp$Default==0)

> x_S <- NULL

> y_S <- NULL

> z_S <- NULL

>defaultcnt_s <- 0

> goodcnt_s <- 0

> ordereddata <-tmp[order(tmp$FINALGRADE),]

> default <-  ifelse((ordereddata$Default == 0), 0, 1)

> good <- ifelse((ordereddata$Default == 0), 1, 0)

> for (i in 1:n_S)

> {x_S[i] = i/n_S

> defaultcnt_s <- defaultcnt_s + default[i]

> goodcnt_s <- goodcnt_s + good[i]

> y_S[i] <- defaultcnt_s/d

> z_S[i] <- goodcnt_s/g

> }

> K_S <- abs(y_S[which.max(abs(y_S-z_S))]-z_S[which.max(abs(y_S-z_S))])

> ks.test(y_S,z_S,alternative = c('two.sided','less','greater'))
    Variable    FINALGRADE  FINALSCORE  PREOVERRIDESCORE    SUBJECTIVESCORE FINANCIALSCORE  
    KS          …           …           …                   …           …

3 个答案:

答案 0 :(得分:0)

这是一种方法,在基数R中。(末尾的数据。)

您想进行成对比较/测试。函数expand.grid将为您提供2个或更多向量的每个成对组合。 (它可以与单个向量一起使用,但这没那么有趣。)

cn <- colnames(x)[-(1:2)] # don't need ID, Default
eg <- expand.grid(x=cn, y=cn, stringsAsFactors = FALSE)
nrow(eg)
# [1] 25
head(eg)
#                  x          y
# 1       FINALGRADE FINALGRADE
# 2       FINALSCORE FINALGRADE
# 3 PREOVERRIDESCORE FINALGRADE
# 4  SUBJECTIVESCORE FINALGRADE
# 5   FINANCIALSCORE FINALGRADE
# 6       FINALGRADE FINALSCORE

我们可以删除测试“第1列与第1列”。我们还可以减少测试“ 1对2” “ 2对1”(结果相同),因此我们也将删除这些重复项(通过逐行排序并删除重复项)。

eg <- eg[ eg$x != eg$y, ]
eg$x1 <- ifelse(eg$x < eg$y, eg$x, eg$y)
eg$y <- ifelse(eg$x < eg$y, eg$y, eg$x)
eg$x <- eg$x1
eg$x1 <- NULL # remove the added column
eg <- eg[ !duplicated(eg), ]
head(eg)
#            x                y
# 2 FINALGRADE       FINALSCORE
# 3 FINALGRADE PREOVERRIDESCORE
# 4 FINALGRADE  SUBJECTIVESCORE
# 5 FINALGRADE   FINANCIALSCORE
# 8 FINALSCORE PREOVERRIDESCORE
# 9 FINALSCORE  SUBJECTIVESCORE
nrow(eg)
# [1] 10

现在我们可以进行测试了。

results <- Map(function(i1, i2) ks.test(x[,i1], x[,i2]),
               eg$x, eg$y)
# Warning in ks.test(x[, i1], x[, i2]) :
#   cannot compute exact p-value with ties
### repeated total of eight times ... it's a problem with the data

results[[1]]
#   Two-sample Kolmogorov-Smirnov test
# data:  x[, i1] and x[, i2]
# D = 0.93333, p-value = 8.943e-12
# alternative hypothesis: two-sided

我们现在将所有10个测试都放在一个命名列表中。不幸的是,这些名称并不是十分有用,因为它们目前仅包含第一个变量,而不能同时包含两个变量。

names(results)
#  [1] "FINALGRADE"       "FINALGRADE"       "FINALGRADE"       "FINALGRADE"      
#  [5] "FINALSCORE"       "FINALSCORE"       "FINALSCORE"       "PREOVERRIDESCORE"
#  [9] "FINANCIALSCORE"   "FINANCIALSCORE"  

解决这个问题很容易。

names(results) <- paste(eg$x, eg$y, sep = "_")
str(results[1:2])
# List of 2
#  $ FINALGRADE_FINALSCORE      :List of 5
#   ..$ statistic  : Named num 0.933
#   .. ..- attr(*, "names")= chr "D"
#   ..$ p.value    : num 8.94e-12
#   ..$ alternative: chr "two-sided"
#   ..$ method     : chr "Two-sample Kolmogorov-Smirnov test"
#   ..$ data.name  : chr "x[, i1] and x[, i2]"
#   ..- attr(*, "class")= chr "htest"
#  $ FINALGRADE_PREOVERRIDESCORE:List of 5
#   ..$ statistic  : Named num 0.967
#   .. ..- attr(*, "names")= chr "D"
#   ..$ p.value    : num 1.34e-12
#   ..$ alternative: chr "two-sided"
#   ..$ method     : chr "Two-sample Kolmogorov-Smirnov test"
#   ..$ data.name  : chr "x[, i1] and x[, i2]"
#   ..- attr(*, "class")= chr "htest"

您可能希望正义每个测试中的p.value(或其他统计信息)。可以很容易地通过lapply(返回list)或sapply(通常返回向量)来实现:

sapply(results, `[[`, "p.value")
#            FINALGRADE_FINALSCORE      FINALGRADE_PREOVERRIDESCORE 
#                     8.942624e-12                     1.337597e-12 
#       FINALGRADE_SUBJECTIVESCORE        FINALGRADE_FINANCIALSCORE 
#                     1.871836e-13                     8.942624e-12 
#      FINALSCORE_PREOVERRIDESCORE       FINALSCORE_SUBJECTIVESCORE 
#                     9.999999e-01                     1.106169e-03 
#        FINALSCORE_FINANCIALSCORE PREOVERRIDESCORE_SUBJECTIVESCORE 
#                     1.350035e-01                     2.908301e-03 
#  FINANCIALSCORE_PREOVERRIDESCORE   FINANCIALSCORE_SUBJECTIVESCORE 
#                     1.350035e-01                     3.239194e-06 

或者,您可以将其添加到eg框架中,而不是处理单个矢量等:

head(eg)
#            x                y      p.value
# 2 FINALGRADE       FINALSCORE 8.942624e-12
# 3 FINALGRADE PREOVERRIDESCORE 1.337597e-12
# 4 FINALGRADE  SUBJECTIVESCORE 1.871836e-13
# 5 FINALGRADE   FINANCIALSCORE 8.942624e-12
# 8 FINALSCORE PREOVERRIDESCORE 9.999999e-01
# 9 FINALSCORE  SUBJECTIVESCORE 1.106169e-03

如果您想要所有这些,也可以轻松做到:

cbind(eg, sapply(c("statistic", "p.value", "alternative"),
                 function(nm) sapply(results, `[[`, nm),
                 simplify = FALSE),
      stringsAsFactors = FALSE)
#                   x                y      p.value  statistic      p.value alternative
# 2        FINALGRADE       FINALSCORE 8.942624e-12 0.93333333 8.942624e-12   two-sided
# 3        FINALGRADE PREOVERRIDESCORE 1.337597e-12 0.96666667 1.337597e-12   two-sided
# 4        FINALGRADE  SUBJECTIVESCORE 1.871836e-13 1.00000000 1.871836e-13   two-sided
# 5        FINALGRADE   FINANCIALSCORE 8.942624e-12 0.93333333 8.942624e-12   two-sided
# 8        FINALSCORE PREOVERRIDESCORE 9.999999e-01 0.06666667 9.999999e-01   two-sided
# 9        FINALSCORE  SUBJECTIVESCORE 1.106169e-03 0.50000000 1.106169e-03   two-sided
# 10       FINALSCORE   FINANCIALSCORE 1.350035e-01 0.30000000 1.350035e-01   two-sided
# 14 PREOVERRIDESCORE  SUBJECTIVESCORE 2.908301e-03 0.46666667 2.908301e-03   two-sided
# 15   FINANCIALSCORE PREOVERRIDESCORE 1.350035e-01 0.30000000 1.350035e-01   two-sided
# 20   FINANCIALSCORE  SUBJECTIVESCORE 3.239194e-06 0.66666667 3.239194e-06   two-sided

数据:

x <- read.table(header=TRUE, text='
ID  Default FINALGRADE  FINALSCORE  PREOVERRIDESCORE    SUBJECTIVESCORE FINANCIALSCORE
10009011    0   8   67.65854557 67.65854557 68.36424313 60.2136826
10020003    0   7   72.18560889 72.18560889 70.97483009 64.35831722
10020003    0   6   77.23072833 77.23072833 69.87370952 71.53180821
10021201    0   14  40.21338437 40.21338437 58.06865599 40.54564338
10021201    0   8   68.79085151 68.79085151 72.59254723 58.91827403
10022730    0   4   84.47284986 84.47284986 78.03588557 77.85944161
10022731    0   5   78.28775535 78.28775535 82.07915713 64.45948626
10025555    0   15  7.907947702 7.907947702 57.95049201 4.075100629
10025555    0   13  1.75            47.15981982 72.56744037 39.16338519
10025763    0   15  66.39063143 66.39063143 79.10054245 52.66288527
10029315    1   14  40.36515221 40.36515221 57.9586825  40.78027744
10030999    0   17  25.78498104 25.78498104 84.37428799 16.36896422
10030999    0   13  47.90043592 47.90043592 78.97405559 36.28646008
10033303    0   10  58.50724135 58.50724135 74.95635833 47.05689989
10033938    0   15  32.79988473 37.79988473 45.90931406 43.84648718
10039393    1   8   67.31395864 67.31395864 74.81030489 55.26979858
10039780    0   9   64.94318991 69.94318991 69.44595762 62.06825469
10040777    0   13  44.93908421 44.93908421 81.83346015 32.38398138
10041213    0   15  33.05768436 33.05768436 73.75578861 27.6882957
10041213    0   15  35.39463308 35.39463308 73.75578861 28.95912606
10045566    1   8   70.60067856 70.60067856 70.87753432 61.88535995
10045566    0   10  58.50956434 58.50956434 70.87753432 49.89960356
10045692    0   12  50.52222802 50.52222802 50.91083454 52.10279587
10045692    0   10  59.17371704 59.17371704 57.49697166 57.37504351
10046390    1   10  60.47796914 60.47796914 67.94551866 52.29460738
10047830    0   12  51.46066369 51.46066369 79.14482394 39.16019407
10048824    0   13  50.86887099 50.86887099 65.6366083  46.18752406
10048824    0   12  49.82958553 49.82958553 60.56566557 47.97788939
10050504    0   8   67.47839481 67.47839481 72.53163793 58.4371572
10050504    0   7   73.7608865  73.7608865  69.49809267 67.26984194')

答案 1 :(得分:0)

您也可以尝试:

results <- outer( 
  1:ncol(df), 1:ncol(df), 
  Vectorize(
    function (i,j) ks.test(df[,i], df[,j], alternative = c('two.sided'))$p.value
  ) 
)
#Because the output is matrix we can convert it to data.frame and set the names of the columns 
results <- as.data.frame(results)
names(results) <- names(df)

结果:

                          ID      Default   FINALGRADE
ID               1.000000e+00 1.871836e-13 1.871836e-13
Default          1.871836e-13 1.000000e+00 1.871836e-13
FINALGRADE       1.871836e-13 1.871836e-13 1.000000e+00
FINALSCORE       1.871836e-13 1.871836e-13 8.942624e-12
PREOVERRIDESCORE 1.871836e-13 1.871836e-13 1.337597e-12
SUBJECTIVESCORE  1.871836e-13 1.871836e-13 1.871836e-13
FINANCIALSCORE   1.871836e-13 1.871836e-13 8.942624e-12
                   FINALSCORE PREOVERRIDESCORE SUBJECTIVESCORE
ID               1.871836e-13     1.871836e-13    1.871836e-13
Default          1.871836e-13     1.871836e-13    1.871836e-13
FINALGRADE       8.942624e-12     1.337597e-12    1.871836e-13
FINALSCORE       1.000000e+00     9.999999e-01    1.106169e-03
PREOVERRIDESCORE 9.999999e-01     1.000000e+00    2.908301e-03
SUBJECTIVESCORE  1.106169e-03     2.908301e-03    1.000000e+00
FINANCIALSCORE   1.350035e-01     1.350035e-01    3.239194e-06
                 FINANCIALSCORE
ID                 1.871836e-13
Default            1.871836e-13
FINALGRADE         8.942624e-12
FINALSCORE         1.350035e-01
PREOVERRIDESCORE   1.350035e-01
SUBJECTIVESCORE    3.239194e-06

答案 2 :(得分:0)

非常感谢@ r2evans和@DJV的回复。

感谢您能进一步解决我的问题。

也许我没有清楚地解释我的KS计算。实际上,我需要计算最大得分图的KS /默认(默认= 1)和良好(默认= 0)之间的得分差。

这就是为什么我有这部分代码来查看良好的发行版和默认发行版的原因。因此,基本上,我计算KS以查看FINALGRADE(默认= 1)与FINALGRADE(默认= 0)之间的差异。

  

y_S [i] <-defaultcnt_s / d

     

z_S [i] <-goodcnt_s / g

     

K_S <-abs(y_S [哪个.max(abs(y_S-z_S))]-z_S [哪个最大(abs(y_S-z_S))])

附上FINALGRADE KS的理论和简单说明,即好与坏的最大区别:

FINALGRADE KS which is the max difference of good vs. bad