我是R编码领域的初学者。
我想为多个变量计算KS stat,从而变量在excel中。
我一直在尝试将计算从一个变量循环到另一个变量,并将KS结果存储在数据帧中。
我没有问题可以为单个变量计算KS统计信息FINALGRADE
。
以下是我具有多个变量的数据tmp。
问题:如何获取KS r代码以从一个变量计算到另一个变量,并将KS结果存储在数据框中?
KS统计信息是默认客户和非默认客户得分之间的差异。
ID Default FINALGRADE FINALSCORE PREOVERRIDESCORE SUBJECTIVESCORE FINANCIALSCORE
10009011 0 8 67.65854557 67.65854557 68.36424313 60.2136826
10020003 0 7 72.18560889 72.18560889 70.97483009 64.35831722
10020003 0 6 77.23072833 77.23072833 69.87370952 71.53180821
10021201 0 14 40.21338437 40.21338437 58.06865599 40.54564338
10021201 0 8 68.79085151 68.79085151 72.59254723 58.91827403
10022730 0 4 84.47284986 84.47284986 78.03588557 77.85944161
10022731 0 5 78.28775535 78.28775535 82.07915713 64.45948626
10025555 0 15 7.907947702 7.907947702 57.95049201 4.075100629
10025555 0 13 1.75 47.15981982 72.56744037 39.16338519
10025763 0 15 66.39063143 66.39063143 79.10054245 52.66288527
10029315 1 14 40.36515221 40.36515221 57.9586825 40.78027744
10030999 0 17 25.78498104 25.78498104 84.37428799 16.36896422
10030999 0 13 47.90043592 47.90043592 78.97405559 36.28646008
10033303 0 10 58.50724135 58.50724135 74.95635833 47.05689989
10033938 0 15 32.79988473 37.79988473 45.90931406 43.84648718
10039393 1 8 67.31395864 67.31395864 74.81030489 55.26979858
10039780 0 9 64.94318991 69.94318991 69.44595762 62.06825469
10040777 0 13 44.93908421 44.93908421 81.83346015 32.38398138
10041213 0 15 33.05768436 33.05768436 73.75578861 27.6882957
10041213 0 15 35.39463308 35.39463308 73.75578861 28.95912606
10045566 1 8 70.60067856 70.60067856 70.87753432 61.88535995
10045566 0 10 58.50956434 58.50956434 70.87753432 49.89960356
10045692 0 12 50.52222802 50.52222802 50.91083454 52.10279587
10045692 0 10 59.17371704 59.17371704 57.49697166 57.37504351
10046390 1 10 60.47796914 60.47796914 67.94551866 52.29460738
10047830 0 12 51.46066369 51.46066369 79.14482394 39.16019407
10048824 0 13 50.86887099 50.86887099 65.6366083 46.18752406
10048824 0 12 49.82958553 49.82958553 60.56566557 47.97788939
10050504 0 8 67.47839481 67.47839481 72.53163793 58.4371572
10050504 0 7 73.7608865 73.7608865 69.49809267 67.26984194
# calculate KS
> n_S <- length(tmp$FINALGRADE)
> d <- sum(tmp$Default)
> g <- sum(tmp$Default==0)
> x_S <- NULL
> y_S <- NULL
> z_S <- NULL
>defaultcnt_s <- 0
> goodcnt_s <- 0
> ordereddata <-tmp[order(tmp$FINALGRADE),]
> default <- ifelse((ordereddata$Default == 0), 0, 1)
> good <- ifelse((ordereddata$Default == 0), 1, 0)
> for (i in 1:n_S)
> {x_S[i] = i/n_S
> defaultcnt_s <- defaultcnt_s + default[i]
> goodcnt_s <- goodcnt_s + good[i]
> y_S[i] <- defaultcnt_s/d
> z_S[i] <- goodcnt_s/g
> }
> K_S <- abs(y_S[which.max(abs(y_S-z_S))]-z_S[which.max(abs(y_S-z_S))])
> ks.test(y_S,z_S,alternative = c('two.sided','less','greater'))
Variable FINALGRADE FINALSCORE PREOVERRIDESCORE SUBJECTIVESCORE FINANCIALSCORE
KS … … … … …
答案 0 :(得分:0)
这是一种方法,在基数R中。(末尾的数据。)
您想进行成对比较/测试。函数expand.grid
将为您提供2个或更多向量的每个成对组合。 (它可以与单个向量一起使用,但这没那么有趣。)
cn <- colnames(x)[-(1:2)] # don't need ID, Default
eg <- expand.grid(x=cn, y=cn, stringsAsFactors = FALSE)
nrow(eg)
# [1] 25
head(eg)
# x y
# 1 FINALGRADE FINALGRADE
# 2 FINALSCORE FINALGRADE
# 3 PREOVERRIDESCORE FINALGRADE
# 4 SUBJECTIVESCORE FINALGRADE
# 5 FINANCIALSCORE FINALGRADE
# 6 FINALGRADE FINALSCORE
我们可以删除测试“第1列与第1列”。我们还可以减少测试“ 1对2” 和“ 2对1”(结果相同),因此我们也将删除这些重复项(通过逐行排序并删除重复项)。>
eg <- eg[ eg$x != eg$y, ]
eg$x1 <- ifelse(eg$x < eg$y, eg$x, eg$y)
eg$y <- ifelse(eg$x < eg$y, eg$y, eg$x)
eg$x <- eg$x1
eg$x1 <- NULL # remove the added column
eg <- eg[ !duplicated(eg), ]
head(eg)
# x y
# 2 FINALGRADE FINALSCORE
# 3 FINALGRADE PREOVERRIDESCORE
# 4 FINALGRADE SUBJECTIVESCORE
# 5 FINALGRADE FINANCIALSCORE
# 8 FINALSCORE PREOVERRIDESCORE
# 9 FINALSCORE SUBJECTIVESCORE
nrow(eg)
# [1] 10
现在我们可以进行测试了。
results <- Map(function(i1, i2) ks.test(x[,i1], x[,i2]),
eg$x, eg$y)
# Warning in ks.test(x[, i1], x[, i2]) :
# cannot compute exact p-value with ties
### repeated total of eight times ... it's a problem with the data
results[[1]]
# Two-sample Kolmogorov-Smirnov test
# data: x[, i1] and x[, i2]
# D = 0.93333, p-value = 8.943e-12
# alternative hypothesis: two-sided
我们现在将所有10个测试都放在一个命名列表中。不幸的是,这些名称并不是十分有用,因为它们目前仅包含第一个变量,而不能同时包含两个变量。
names(results)
# [1] "FINALGRADE" "FINALGRADE" "FINALGRADE" "FINALGRADE"
# [5] "FINALSCORE" "FINALSCORE" "FINALSCORE" "PREOVERRIDESCORE"
# [9] "FINANCIALSCORE" "FINANCIALSCORE"
解决这个问题很容易。
names(results) <- paste(eg$x, eg$y, sep = "_")
str(results[1:2])
# List of 2
# $ FINALGRADE_FINALSCORE :List of 5
# ..$ statistic : Named num 0.933
# .. ..- attr(*, "names")= chr "D"
# ..$ p.value : num 8.94e-12
# ..$ alternative: chr "two-sided"
# ..$ method : chr "Two-sample Kolmogorov-Smirnov test"
# ..$ data.name : chr "x[, i1] and x[, i2]"
# ..- attr(*, "class")= chr "htest"
# $ FINALGRADE_PREOVERRIDESCORE:List of 5
# ..$ statistic : Named num 0.967
# .. ..- attr(*, "names")= chr "D"
# ..$ p.value : num 1.34e-12
# ..$ alternative: chr "two-sided"
# ..$ method : chr "Two-sample Kolmogorov-Smirnov test"
# ..$ data.name : chr "x[, i1] and x[, i2]"
# ..- attr(*, "class")= chr "htest"
您可能希望正义每个测试中的p.value
(或其他统计信息)。可以很容易地通过lapply
(返回list
)或sapply
(通常返回向量)来实现:
sapply(results, `[[`, "p.value")
# FINALGRADE_FINALSCORE FINALGRADE_PREOVERRIDESCORE
# 8.942624e-12 1.337597e-12
# FINALGRADE_SUBJECTIVESCORE FINALGRADE_FINANCIALSCORE
# 1.871836e-13 8.942624e-12
# FINALSCORE_PREOVERRIDESCORE FINALSCORE_SUBJECTIVESCORE
# 9.999999e-01 1.106169e-03
# FINALSCORE_FINANCIALSCORE PREOVERRIDESCORE_SUBJECTIVESCORE
# 1.350035e-01 2.908301e-03
# FINANCIALSCORE_PREOVERRIDESCORE FINANCIALSCORE_SUBJECTIVESCORE
# 1.350035e-01 3.239194e-06
或者,您可以将其添加到eg
框架中,而不是处理单个矢量等:
head(eg)
# x y p.value
# 2 FINALGRADE FINALSCORE 8.942624e-12
# 3 FINALGRADE PREOVERRIDESCORE 1.337597e-12
# 4 FINALGRADE SUBJECTIVESCORE 1.871836e-13
# 5 FINALGRADE FINANCIALSCORE 8.942624e-12
# 8 FINALSCORE PREOVERRIDESCORE 9.999999e-01
# 9 FINALSCORE SUBJECTIVESCORE 1.106169e-03
如果您想要所有这些,也可以轻松做到:
cbind(eg, sapply(c("statistic", "p.value", "alternative"),
function(nm) sapply(results, `[[`, nm),
simplify = FALSE),
stringsAsFactors = FALSE)
# x y p.value statistic p.value alternative
# 2 FINALGRADE FINALSCORE 8.942624e-12 0.93333333 8.942624e-12 two-sided
# 3 FINALGRADE PREOVERRIDESCORE 1.337597e-12 0.96666667 1.337597e-12 two-sided
# 4 FINALGRADE SUBJECTIVESCORE 1.871836e-13 1.00000000 1.871836e-13 two-sided
# 5 FINALGRADE FINANCIALSCORE 8.942624e-12 0.93333333 8.942624e-12 two-sided
# 8 FINALSCORE PREOVERRIDESCORE 9.999999e-01 0.06666667 9.999999e-01 two-sided
# 9 FINALSCORE SUBJECTIVESCORE 1.106169e-03 0.50000000 1.106169e-03 two-sided
# 10 FINALSCORE FINANCIALSCORE 1.350035e-01 0.30000000 1.350035e-01 two-sided
# 14 PREOVERRIDESCORE SUBJECTIVESCORE 2.908301e-03 0.46666667 2.908301e-03 two-sided
# 15 FINANCIALSCORE PREOVERRIDESCORE 1.350035e-01 0.30000000 1.350035e-01 two-sided
# 20 FINANCIALSCORE SUBJECTIVESCORE 3.239194e-06 0.66666667 3.239194e-06 two-sided
数据:
x <- read.table(header=TRUE, text='
ID Default FINALGRADE FINALSCORE PREOVERRIDESCORE SUBJECTIVESCORE FINANCIALSCORE
10009011 0 8 67.65854557 67.65854557 68.36424313 60.2136826
10020003 0 7 72.18560889 72.18560889 70.97483009 64.35831722
10020003 0 6 77.23072833 77.23072833 69.87370952 71.53180821
10021201 0 14 40.21338437 40.21338437 58.06865599 40.54564338
10021201 0 8 68.79085151 68.79085151 72.59254723 58.91827403
10022730 0 4 84.47284986 84.47284986 78.03588557 77.85944161
10022731 0 5 78.28775535 78.28775535 82.07915713 64.45948626
10025555 0 15 7.907947702 7.907947702 57.95049201 4.075100629
10025555 0 13 1.75 47.15981982 72.56744037 39.16338519
10025763 0 15 66.39063143 66.39063143 79.10054245 52.66288527
10029315 1 14 40.36515221 40.36515221 57.9586825 40.78027744
10030999 0 17 25.78498104 25.78498104 84.37428799 16.36896422
10030999 0 13 47.90043592 47.90043592 78.97405559 36.28646008
10033303 0 10 58.50724135 58.50724135 74.95635833 47.05689989
10033938 0 15 32.79988473 37.79988473 45.90931406 43.84648718
10039393 1 8 67.31395864 67.31395864 74.81030489 55.26979858
10039780 0 9 64.94318991 69.94318991 69.44595762 62.06825469
10040777 0 13 44.93908421 44.93908421 81.83346015 32.38398138
10041213 0 15 33.05768436 33.05768436 73.75578861 27.6882957
10041213 0 15 35.39463308 35.39463308 73.75578861 28.95912606
10045566 1 8 70.60067856 70.60067856 70.87753432 61.88535995
10045566 0 10 58.50956434 58.50956434 70.87753432 49.89960356
10045692 0 12 50.52222802 50.52222802 50.91083454 52.10279587
10045692 0 10 59.17371704 59.17371704 57.49697166 57.37504351
10046390 1 10 60.47796914 60.47796914 67.94551866 52.29460738
10047830 0 12 51.46066369 51.46066369 79.14482394 39.16019407
10048824 0 13 50.86887099 50.86887099 65.6366083 46.18752406
10048824 0 12 49.82958553 49.82958553 60.56566557 47.97788939
10050504 0 8 67.47839481 67.47839481 72.53163793 58.4371572
10050504 0 7 73.7608865 73.7608865 69.49809267 67.26984194')
答案 1 :(得分:0)
您也可以尝试:
results <- outer(
1:ncol(df), 1:ncol(df),
Vectorize(
function (i,j) ks.test(df[,i], df[,j], alternative = c('two.sided'))$p.value
)
)
#Because the output is matrix we can convert it to data.frame and set the names of the columns
results <- as.data.frame(results)
names(results) <- names(df)
结果:
ID Default FINALGRADE
ID 1.000000e+00 1.871836e-13 1.871836e-13
Default 1.871836e-13 1.000000e+00 1.871836e-13
FINALGRADE 1.871836e-13 1.871836e-13 1.000000e+00
FINALSCORE 1.871836e-13 1.871836e-13 8.942624e-12
PREOVERRIDESCORE 1.871836e-13 1.871836e-13 1.337597e-12
SUBJECTIVESCORE 1.871836e-13 1.871836e-13 1.871836e-13
FINANCIALSCORE 1.871836e-13 1.871836e-13 8.942624e-12
FINALSCORE PREOVERRIDESCORE SUBJECTIVESCORE
ID 1.871836e-13 1.871836e-13 1.871836e-13
Default 1.871836e-13 1.871836e-13 1.871836e-13
FINALGRADE 8.942624e-12 1.337597e-12 1.871836e-13
FINALSCORE 1.000000e+00 9.999999e-01 1.106169e-03
PREOVERRIDESCORE 9.999999e-01 1.000000e+00 2.908301e-03
SUBJECTIVESCORE 1.106169e-03 2.908301e-03 1.000000e+00
FINANCIALSCORE 1.350035e-01 1.350035e-01 3.239194e-06
FINANCIALSCORE
ID 1.871836e-13
Default 1.871836e-13
FINALGRADE 8.942624e-12
FINALSCORE 1.350035e-01
PREOVERRIDESCORE 1.350035e-01
SUBJECTIVESCORE 3.239194e-06
答案 2 :(得分:0)
非常感谢@ r2evans和@DJV的回复。
感谢您能进一步解决我的问题。
也许我没有清楚地解释我的KS计算。实际上,我需要计算最大得分图的KS /默认(默认= 1)和良好(默认= 0)之间的得分差。
这就是为什么我有这部分代码来查看良好的发行版和默认发行版的原因。因此,基本上,我计算KS以查看FINALGRADE(默认= 1)与FINALGRADE(默认= 0)之间的差异。
y_S [i] <-defaultcnt_s / d
z_S [i] <-goodcnt_s / g
K_S <-abs(y_S [哪个.max(abs(y_S-z_S))]-z_S [哪个最大(abs(y_S-z_S))])
附上FINALGRADE KS的理论和简单说明,即好与坏的最大区别: