我想使用R包plyr在一个非常大的数据框上运行成对t测试,但我不知道该怎么做。我最近学会了如何使用plyr进行相关,我真的很喜欢你如何指定你想要比较的组,然后plyr为你分解数据。例如,您可以让plyr计算虹膜数据集中每种虹膜的萼片长度和萼片宽度之间的相关性,如下所示:
Correlations <- ddply(iris, "Species", function(x) cor(x$Sepal.Length, x$Sepal.Width))
我可以通过指定虹膜的setosa种类的数据在1:50行中等来自行打破数据框架,但是plyr比我更不容易陷入困境例如,不小心说出行1:51。
那么如何使用配对t检验做类似的事情呢?如何指定哪些观察对?这里的一些示例数据与我正在使用的数据类似,我希望这些对是主题,我希望通过Pesticide打破数据:< / p>
Exposure <- data.frame("Subject" = rep(1:4, 6),
"Season" = rep(c(rep("summer", 4), rep("winter", 4)),3),
"Pesticide" = rep(c("atrazine", "metolachlor", "chlorpyrifos"), each=8),
"Exposure" = sample(1:100, size=24))
Exposure$Subject <- as.factor(Exposure$Subject)
换句话说,我想评估的问题是,在冬季和夏季期间,每个人的农药暴露是否存在差异,我想分别回答这个问题。三种农药中的每一种。
提前多多谢谢!
编辑:为了澄清,这是如何在plyr中进行非配对t测试:
TTests <- dlply(Exposure, "Pesticide", function(x) t.test(x$Exposure ~ x$Season))
如果我添加&#34;配对= T&#34;在那里,plyr 将进行配对t检验,但它假设我总是以相同的顺序排列对。虽然我在上面的示例数据框中以相同的顺序将它们全部放在一起,但我不会在我的实际数据中,因为我有时会丢失数据。
答案 0 :(得分:2)
你想要这个吗?
library(data.table)
# convert to data.table in place
setDT(Exposure)
# make sure data is sorted correctly
setkey(Exposure, Pesticide, Season, Subject)
Exposure[, list(res = list(t.test(Exposure[Season == "summer"],
Exposure[Season == "winter"],
paired = T)))
, by = Pesticide]$res
#[[1]]
#
# Paired t-test
#
#data: Exposure[Season == "summer"] and Exposure[Season == "winter"]
#t = -4.1295, df = 3, p-value = 0.02576
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -31.871962 -4.128038
#sample estimates:
#mean of the differences
# -18
#
#
#[[2]]
#
# Paired t-test
#
#data: Exposure[Season == "summer"] and Exposure[Season == "winter"]
#t = -6.458, df = 3, p-value = 0.007532
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -73.89299 -25.10701
#sample estimates:
#mean of the differences
# -49.5
#
#
#[[3]]
#
# Paired t-test
#
#data: Exposure[Season == "summer"] and Exposure[Season == "winter"]
#t = -2.5162, df = 3, p-value = 0.08646
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -30.008282 3.508282
#sample estimates:
#mean of the differences
# -13.25
答案 1 :(得分:0)
我不知道ddply
,但这是我使用base
函数的方法。
by(data = Exposure, INDICES = Exposure$Pesticide, FUN = function(x) {
t.test(Exposure ~ Season, data = x)
})
Exposure$Pesticide: atrazine
Welch Two Sample t-test
data: Exposure by Season
t = -0.1468, df = 5.494, p-value = 0.8885
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-49.63477 44.13477
sample estimates:
mean in group summer mean in group winter
60.50 63.25
----------------------------------------------------------------------------------------------
Exposure$Pesticide: chlorpyrifos
Welch Two Sample t-test
data: Exposure by Season
t = -0.8932, df = 4.704, p-value = 0.4151
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-83.58274 41.08274
sample estimates:
mean in group summer mean in group winter
52.25 73.50
----------------------------------------------------------------------------------------------
Exposure$Pesticide: metolachlor
Welch Two Sample t-test
data: Exposure by Season
t = 0.8602, df = 5.561, p-value = 0.4252
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-39.8993 81.8993
sample estimates:
mean in group summer mean in group winter
62.5 41.5