同事
我想估计从一组随机选择的项目比从另一组随机选择的项目得分更高的可能性。这就是优势概率,有时也称为通用语言效果量。例如,请参见:https://rpsychologist.com/d3/cohend/。如果我们接受数据是正态分布的(McGraw and Wong (1992, Psychological Bulletin, 111),但是我知道我的数据不是正态分布的,那么这种估计就不可靠了。
我的解决方法是简单的蛮力。使用样本数据,我们将一组中的每个观测值与另一组中的每个观测值进行比较,计算a> b的频率并将任何关系减半。
我的第一次尝试是 For If Else 循环,如下所示:
# Generate sample data
alpha <- as.integer(runif(2000, min = 0, max = 100))
beta <- as.integer(runif(2000, min = 0, max = 100))
nPairs <- length(alpha) * length(beta)
#initialise vars
bigger <- 0
equal <- 0.0
smaller <- 0
# Loop
for (i in alpha) {
for (j in beta) {
if (i > j) {bigger <- bigger + 1}
else
if (i < j) {smaller <- smaller + 1}
else
{equal <- equal + 0.5}
}
}
# record Probability a > b
PaGTb <- (bigger + equal) / nPairs
这可以完成工作,但是速度非常慢,尤其是在 Shiny 应用中。
一个同事提醒我 R 是基于矢量的,并建议使用expand.grid
函数。所以我的第二次尝试看起来像这样:
# generating sample data
alpha <- as.integer(runif(2000, min = 0, max = 100))
beta <- as.integer(runif(2000, min = 0, max = 100))
nPairs <- length(alpha) * length(beta)
# Each observation in alpha is paired with each observation in beta
c <- expand.grid(a = alpha, b = beta)
# Count frequency of a > b
aGTb <- length(which(c$a > c$b))
aEQb <- length(which(c$a == c$b))
aGTb <- aGTb + (0.5 * aEQb)
# record Probability a > b
PaGTb <- aGTb / nPairs
速度明显更快!
但是步伐不是更快。通过仿真,我发现对于许多成对的比较(百万),基于矢量的方法更快,但是对于相对较少的比较(千),For-If-Else方法更快。下表总结了在iMac上进行3334 * 3000 = 10,002,000个配对比较的1000次复制的结果。
time_ForIfElse time_Vector Ratio_Vector/ForIfElse
Min. :0.3514 Min. :0.2835 Min. :0.2713
1st Qu.:0.3648 1st Qu.:0.2959 1st Qu.:0.7818
Median :0.3785 Median :0.3102 Median :0.8115
Mean :0.4037 Mean :0.3322 Mean :0.8309
3rd Qu.:0.4419 3rd Qu.:0.3678 3rd Qu.:0.8668
Max. :1.4124 Max. :0.5896 Max. :1.4726
因此,对于我正在使用的数据类型,基于矢量的方法比我的原始方法快约20%。但是我仍然认为可能会有更快的方法来解决这个问题。
社区有什么想法吗?
答案 0 :(得分:5)
这是一个基于table()
并受@ chinsoon12启发的完整基础解决方案。
f_int_AUC <- function(a, b){
A <- table(a)
A_Freq = as.vector(A)
A_alpha = as.integer(names(A))
B <- table(b)
B_Freq = as.vector(B)
B_beta = as.integer(names(B))
bigger = 0
equal = 0
for (i in seq_along(A_alpha)){
bigger = bigger + sum(A_Freq[i] * B_Freq[A_alpha[i] > B_beta])
equal = equal + 0.5 * sum(A_Freq[i] * B_Freq[A_alpha[i] == B_beta])
}
(bigger + equal) / (length(a) * length(b))
}
将此功能用作函数很重要-它在4,000行上比未编译的循环快大约8倍。
Unit: milliseconds
expr min lq mean median uq max neval
base_int_AUC_fx 1.0187 1.03865 1.146774 1.10930 1.13230 5.3215 100
base_int_AUC 8.3071 8.43380 9.309865 8.53855 8.88005 40.3327 100
对该函数的分析表明,table()
调用正在使此解决方案变慢。为了解决这个问题,现在合并了tabulate()
以显着提高速度。 tabulate()
的缺点是它没有命名垃圾箱。因此,我们需要对垃圾箱进行归一化(将h / t更改为@ chinsoon12,以额外提高20%):
f_int_AUC2 <- function(a,b) {
# tabulate does not include 0, therefore we need to
# normalize to positive integers.
m <- min(c(a,b))
A_Freq = tabulate(a - min(m) + 1)
B_Freq = tabulate(b - min(m) + 1)
# calculate the outer product.
out_matrix <- outer(A_Freq, B_Freq)
bigger = sum(out_matrix[lower.tri(out_matrix)])
equal = 0.5 * sum(diag(out_matrix))
(bigger + equal) / length(a) / length(b)
}
需要注意的一件事是,使用table()
或tabulate()
进行假设,在浮点数上效果不佳。根据@Aaron的建议,我查看了wilcox.text
代码,并根据该问题进行了一些简化。请注意,我从函数中提取了代码-wilcox.test()
函数是437行代码-这仅是4行,因此可以显着提高速度。
f_wilcox_AUC <- function(a, b){
# r <- data.table::frank(c(a, b)) #is much faster
r <- rank(c(a, b))
n.x <- as.integer(length(a))
n.y <- as.integer(length(b))
# from wilcox.test STATISTIC; I am not a statistician and do not follow
# see reference at end or Google "r wilcox.test code"
{sum(r[seq_along(a)]) - n.x * (n.x + 1) / 2} / n.x / n.y
}
性能:
4,000 x 4,000
Unit: microseconds
expr min lq mean median uq max neval
bigstats_prive 719 748 792 797 817 884 10
chinsoon_DT 5910 6050 6280 6210 6350 7180 10
base_int_AUC 1060 1070 2660 1130 1290 16300 10
base_int_AUC2 237 250 1430 256 266 12000 10
wilcox_base 1050 1050 1830 1060 1070 8790 10
wilcox_dt 500 524 2940 530 603 16800 10
wilcox_test 11300 11400 11900 11500 11700 13400 10
40,000 x 40,000
Unit: milliseconds
expr min lq mean median uq max neval
bigstats_prive 4.03 4.07 5.13 4.11 4.27 66.40 100
chinsoon_DT 6.62 7.01 7.88 7.44 7.75 19.20 100
base_int_AUC 4.53 4.60 5.96 4.68 4.81 93.40 100
base_int_AUC2 1.03 1.07 1.14 1.08 1.12 3.70 100
wilcox_base 13.70 13.80 14.10 13.80 14.00 26.50 100
wilcox_dt 1.87 2.00 2.38 2.11 2.23 6.25 100
wilcox_test 115.00 118.00 121.00 121.00 124.00 135.00 100
400,000 x 400,000
Unit: milliseconds
expr min lq mean median uq max neval
bigstats_prive 50.3 54.0 55.0 54.7 56.8 58.6 10
chinsoon_DT 19.8 20.9 22.6 22.7 23.7 27.3 10
base_int_AUC 43.6 45.3 53.3 47.8 49.6 108.0 10
base_int_AUC2 10.0 10.4 11.8 10.7 13.8 14.8 10
wilcox_base 252.0 254.0 271.0 258.0 293.0 316.0 10
wilcox_dt 19.4 20.8 22.1 22.6 23.6 24.2 10
wilcox_test 1440.0 1460.0 1480.0 1480.0 1500.0 1520.0 10
总共,使用基本功能将花费210毫秒来进行4,000 X 4,000比较的1,000次复制:
# A tibble: 7 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 bigstats_prive 680.9us 692.8us 1425. 410.76KB 8.60 994 6 697.45ms <dbl [1]> <df[,3] [12 x 3]> <bch:tm> <tibble [1,000 x 3]>
2 chinsoon_DT 5.55ms 6.02ms 166. 539.92KB 4.78 972 28 5.85s <dbl [1]> <df[,3] [82 x 3]> <bch:tm> <tibble [1,000 x 3]>
3 base_int_AUC 981.8us 1.02ms 958. 750.44KB 10.7 989 11 1.03s <dbl [1]> <df[,3] [606 x 3]> <bch:tm> <tibble [1,000 x 3]>
4 base_int_AUC2 203.7us 209.3us 4743. 454.36KB 33.4 993 7 209.38ms <dbl [1]> <df[,3] [22 x 3]> <bch:tm> <tibble [1,000 x 3]>
5 wilcox_base 1.03ms 1.04ms 959. 359.91KB 4.82 995 5 1.04s <dbl [1]> <df[,3] [11 x 3]> <bch:tm> <tibble [1,000 x 3]>
6 wilcox_dt 410.4us 444.5us 2216. 189.09KB 6.67 997 3 450.01ms <dbl [1]> <df[,3] [9 x 3]> <bch:tm> <tibble [1,000 x 3]>
7 wilcox_test 11.35ms 11.44ms 85.6 1.16MB 1.66 981 19 11.45s <dbl [1]> <df[,3] [58 x 3]> <bch:tm> <tibble [1,000 x 3]>
编辑:有关使用outer()
和data.table::CJ()
的低效方法,请参阅以前的方法
参考: https://github.com/SurajGupta/r-source/blob/master/src/library/stats/R/wilcox.test.R
答案 1 :(得分:3)
似乎您要计算类似于ROC曲线下面积(AUC)的东西。
一种快速的方法是使用:
bigstatsr::AUC(c(alpha, beta), rep(1:0, c(length(alpha), length(beta))))
(免责声明:我是{bigstatsr}的作者)
答案 2 :(得分:3)
这是Wilcoxon测试所测试的;测试统计信息中仅存在一个较大的对数的计数,因此只需简单地除以比例即可。
wilcox.test(alpha, beta)$statistic / (length(alpha)*length(beta))
这可能需要针对关系进行调整;我必须检查细节。
答案 3 :(得分:2)
这是使用data.table
进行聚合之前先进行汇总的另一种选择:
DTa <- data.table(a=alpha)[, .N, keyby=a]
DTb <- data.table(b=beta)[, .N, keyby=b]
(DTb[DTa, on=.(b<a), allow.cartesian=TRUE, nomatch=0L, sum(N*i.N)] +
0.5 * DTb[DTa, on=.(b=a), allow.cartesian=TRUE, nomatch=0L, sum(N*i.N)]) / nPairs
计时代码:
mtd0 <- function() {
c <- expand.grid(a = alpha, b = beta)
aGTb <- length(which(c$a > c$b))
aEQb <- length(which(c$a == c$b))
aGTb <- aGTb + (0.5 * aEQb)
aGTb / nPairs
}
mtd1 <- function() {
CJ(a = alpha, b = beta, sorted=FALSE)[, {
aGTb = sum(a > b)
aEQb = sum(a == b)
aGTb = aGTb + 0.5 * aEQb
PaGTb = aGTb / nPairs
}]
}
mtd2 <- function() {
DTa <- data.table(a=alpha)[, .N, keyby=a]
DTb <- data.table(b=beta)[, .N, keyby=b]
(DTb[DTa, on=.(b<a), allow.cartesian=TRUE, nomatch=0L, sum(N*i.N)] +
0.5 * DTb[DTa, on=.(b=a), allow.cartesian=TRUE, nomatch=0L, sum(N*i.N)]) / nPairs
}
bench::mark(mtd0(), mtd1(), mtd2())
时间:
# A tibble: 3 x 14
expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time result memory time gc
<chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <bch:tm> <list> <list> <list> <list>
1 mtd0() 589.55ms 589.55ms 589.55ms 590ms 1.70 580MB 3 1 590ms <dbl [1]> <Rprofmem [20 x 3]> <bch:tm> <tibble [1 x 3]>
2 mtd1() 188.01ms 228.04ms 192.55ms 304ms 4.39 244MB 5 3 684ms <dbl [1]> <Rprofmem [15 x 3]> <bch:tm> <tibble [3 x 3]>
3 mtd2() 4.14ms 4.82ms 4.52ms 23ms 208. 541KB 1 104 501ms <dbl [1]> <Rprofmem [84 x 3]> <bch:tm> <tibble [104 x 3]>
数据:
library(data.table)
set.seed(0L)
nr <- 4e3
alpha <- as.integer(runif(nr, min = 0, max = 100))
beta <- as.integer(runif(nr, min = 0, max = 100))
nPairs <- length(alpha) * length(beta)