我的数据如下:
df1 <- read.table(text = "A1 A2 A3 A4 B1 B2 B3 B4
1 2 4 12 33 17 77 69
34 20 59 21 90 20 43 44
11 16 23 24 19 12 55 98
29 111 335 34 61 88 110 320
51 58 45 39 55 87 55 89", stringsAsFactors = FALSE, header = TRUE, row.names=c("N1","N2","N3","N4","N5"))
我想逐行比较A
和B
之间的值。首先,我要测试A
和B
的分布是否为正态分布。其次,我将测试ks.test
和A
之间的差异是否不同。对于非正态分布结果(p ks.test <0.05),我将通过B
进行wilcox测试。对于正态分布结果,我将通过用var.test
将它们分为相等和不相等的方差ttest来进行ttest。最后,我将所有结果结合起来。
我要做的是,首先,我建立了wilcox.test
,t.test
,ks.test
和五个var.test
两个功能:
wilcox.test
然后,我计算了t.test
和kstest<-function(df, grp1, grp2) {
x = df[grp1]
y = df[grp2]
x = as.numeric(x)
y = as.numeric(y)
results = ks.test(x,y,alternative = c("two.sided"))
results$p.value
}
vartest<-function(df, grp1, grp2) {
x = df[grp1]
y = df[grp2]
x = as.numeric(x)
y = as.numeric(y)
results = var.test(x,y,alternative = c("two.sided"))
results$p.value
}
wilcox<-function(df, grp1, grp2) {
x = df[grp1]
y = df[grp2]
x = as.numeric(x)
y = as.numeric(y)
results = wilcox.test(x,y,alternative = c("two.sided"))
results$p.value
}
ttest_equal<-function(df, grp1, grp2) {
x = df[grp1]
y = df[grp2]
x = as.numeric(x)
y = as.numeric(y)
results = t.test(x,y,alternative = c("two.sided"),var.equal = TRUE)
results$p.value
}
ttest_unequal<-function(df, grp1, grp2) {
x = df[grp1]
y = df[grp2]
x = as.numeric(x)
y = as.numeric(y)
results = t.test(x,y,alternative = c("two.sided"),var.equal = FALSE)
results$p.value
}
的p值以对数据进行子集设置:
ks.test
然后我按照上面的描述对数据进行子集处理:
var.test
最后,我计算与新数据框对应的测试,然后合并结果
ks_AB<-apply(df1,1,kstest,grp1=grepl("^A",colnames(df1)),grp2=grepl("^B",colnames(df1)))
ks_AB
[1] 0.02857143 0.69937420 0.77142857 0.77142857 0.21055163
var_AB<-apply(df1,1,vartest,grp1=grepl("^A",colnames(df1)),grp2=grepl("^B",colnames(df1)))
var_AB
[1] 0.01700168 0.45132827 0.01224175 0.76109048 0.19561742
df1$ks_AB<-ks_AB
df1$var_AB<-var_AB
我知道我的代码很繁琐又愚蠢,但是它对我的数据非常有效。我现在想知道我是否将上面的代码组合到if else函数的新的类似于决策树的函数中,该函数将是:
答案 0 :(得分:1)
这应该可以解决问题:
FOO <- function(df, grp1, grp2){
# perform rowwise kolmogorov smirnov test
ks_AB <- apply(df, 1, kstest, grp1 = grp1, grp2 = grp2)
# subset data by significance of ks test
sub1 <- df[ks_AB < .05, ]
sub2 <- df[ks_AB >= .05, ]
if(nrow(sub1) > 0){
# perform wilcoxon rank sum test on non-normally distributed data
wilc_AB <- apply(sub1, 1, wilcox, grp1 = grp1, grp2 = grp2)
}
if(nrow(sub2) > 0){
# perform f test on normally distributed data
var_AB <- apply(sub2, 1, vartest, grp1 = grp1, grp2 = grp2)
# subset data by significance of f test
varsub1 <- sub2[var_AB < .05, ]
varsub2 <- sub2[var_AB >= .05, ]
if(nrow(varsub1) > 0){
# perform t test with unequal variance on subset with unequal variance
t_uneq_AB <- apply(varsub1, 1, ttest_unequal, grp1 = grp1, grp2 = grp2)
}
if(nrow(varsub2) > 0){
# perform t test with equal variance on subset with equal variance
t_eq_AB <- apply(varsub2, 1, ttest_equal, grp1 = grp1, grp2 = grp2)
}
}
# put together output dataframe
df$ks_AB <- ks_AB
if(exists("var_AB")){
df$var_AB <- NA
df$var_AB[row.names(df) %in% names(var_AB)] <- var_AB
}
df$pvalue <- NA
if(exists("wilc_AB")){
df$pvalue[row.names(df) %in% names(wilc_AB)] <- wilc_AB
}
if(exists("t_uneq_AB")){
df$pvalue[row.names(df) %in% names(t_uneq_AB)] <- t_uneq_AB
}
if(exists("t_eq_AB")){
df$pvalue[row.names(df) %in% names(t_eq_AB)] <- t_eq_AB
}
# return output
return(df)
}
应用于您的数据示例:
> FOO(df1, grepl("^A",colnames(df1)), grp2=grepl("^B",colnames(df1)))
A1 A2 A3 A4 B1 B2 B3 B4 ks_AB var_AB pvalue
N1 1 2 4 12 33 17 77 69 0.02857143 NA 0.02857143
N2 34 20 59 21 90 20 43 44 0.69937420 0.45132827 0.39648631
N3 11 16 23 24 19 12 55 98 0.77142857 0.01224175 0.25822839
N4 29 111 335 34 61 88 110 320 0.77142857 0.76109048 0.85703939
N5 51 58 45 39 55 87 55 89 0.21055163 0.19561742 0.06610608
Warning messages:
1: In ks.test(x, y, alternative = c("two.sided")) :
cannot compute exact p-value with ties
2: In ks.test(x, y, alternative = c("two.sided")) :
cannot compute exact p-value with ties
您可以在函数中使用suppressWarnings()
禁止显示这些警告,但我希望将它们显示出来,以便您知道何时测试可能不准确。