如何迭代组和因子组合来测试平均值的差异?

时间:2016-08-16 14:49:11

标签: r

我有以下数据结构,

date <- as.Date(as.character( c("2015-02-13",
                            "2015-02-13",
                            "2015-02-13",
                            "2015-02-13",
                            "2015-02-13",
                            "2015-02-13",
                            "2015-02-13",
                            "2015-02-13",
                            "2015-02-13",
                            "2015-02-14",
                            "2015-02-14",
                            "2015-02-14",
                            "2015-02-14",
                            "2015-02-14",
                            "2015-02-14",
                            "2015-02-14",
                            "2015-02-14",
                            "2015-02-14",
                            "2015-02-15",
                            "2015-02-15",
                            "2015-02-15",
                            "2015-02-15",
                            "2015-02-15",
                            "2015-02-15",
                            "2015-02-15",
                            "2015-02-15",
                            "2015-02-15")))

name <- c("John","Michael","Thomas",
      "John","Michael","Thomas",
      "John","Michael","Thomas",
      "John","Michael","Thomas",
      "John","Michael","Thomas",
      "John","Michael","Thomas",
      "John","Michael","Thomas",
      "John","Michael","Thomas",
      "John","Michael","Thomas")

drinks <-c("Beer","Coffee","Tee", 
      "Tee","Beer", "Coffee",
      "Coffee","Tee","Beer",
      "Beer","Coffee","Tee", 
      "Tee","Beer", "Coffee",
      "Coffee","Tee","Beer",
      "Beer","Coffee","Tee", 
      "Tee","Beer", "Coffee",
      "Coffee","Tee","Beer")



consumed <- c(3,2,5,3,6,2,9,4,5,
          1,3,5,8,0,1,2,3,5,
          1,24,4,5,7,9,9,1,2)

version_1 <- data.frame(date,name,drinks,consumed)

除了消耗之外,我的第二个数据帧几乎完全相同:

consumed <- c(10,9,1,20,30,1,50,40,20,
          10,2,10,2,1,1,2,3,5,
          20,24,1,40,2,8,4,0,0)

version_2 <- data.frame(date,name,drinks,consumed)


version_1$version <- rep("one", nrow(version_1)) 
version_2$version <- rep("two", nrow(version_2)) 
all <- rbind(version_1, version_2)

all$version <- as.factor(all$version)

  date    name drinks consumed version
1  2015-02-13    John   Beer        3     one
2  2015-02-13 Michael Coffee        2     one
3  2015-02-13  Thomas    Tee        5     one
4  2015-02-13    John    Tee        3     one
5  2015-02-13 Michael   Beer        6     one
6  2015-02-13  Thomas Coffee        2     one
7  2015-02-13    John Coffee        9     one
8  2015-02-13 Michael    Tee        4     one
9  2015-02-13  Thomas   Beer        5     one
10 2015-02-14    John   Beer        1     one
11 2015-02-14 Michael Coffee        3     one
12 2015-02-14  Thomas    Tee        5     one
13 2015-02-14    John    Tee        8     one
14 2015-02-14 Michael   Beer        0     one
15 2015-02-14  Thomas Coffee        1     one
16 2015-02-14    John Coffee        2     one
17 2015-02-14 Michael    Tee        3     one
18 2015-02-14  Thomas   Beer        5     one
19 2015-02-15    John   Beer        1     one
20 2015-02-15 Michael Coffee       24     one
21 2015-02-15  Thomas    Tee        4     one
22 2015-02-15    John    Tee        5     one
23 2015-02-15 Michael   Beer        7     one
24 2015-02-15  Thomas Coffee        9     one
25 2015-02-15    John Coffee        9     one
26 2015-02-15 Michael    Tee        1     one
27 2015-02-15  Thomas   Beer        2     one
28 2015-02-13    John   Beer       10     two
29 2015-02-13 Michael Coffee        9     two
30 2015-02-13  Thomas    Tee        1     two
31 2015-02-13    John    Tee       20     two
32 2015-02-13 Michael   Beer       30     two
33 2015-02-13  Thomas Coffee        1     two
34 2015-02-13    John Coffee       50     two
35 2015-02-13 Michael    Tee       40     two
36 2015-02-13  Thomas   Beer       20     two
37 2015-02-14    John   Beer       10     two
38 2015-02-14 Michael Coffee        2     two
39 2015-02-14  Thomas    Tee       10     two
40 2015-02-14    John    Tee        2     two
41 2015-02-14 Michael   Beer        1     two
42 2015-02-14  Thomas Coffee        1     two
43 2015-02-14    John Coffee        2     two
44 2015-02-14 Michael    Tee        3     two
45 2015-02-14  Thomas   Beer        5     two
46 2015-02-15    John   Beer       20     two
47 2015-02-15 Michael Coffee       24     two
48 2015-02-15  Thomas    Tee        1     two
49 2015-02-15    John    Tee       40     two
50 2015-02-15 Michael   Beer        2     two
51 2015-02-15  Thomas Coffee        8     two
52 2015-02-15    John Coffee        4     two
53 2015-02-15 Michael    Tee        0     two
54 2015-02-15  Thomas   Beer        0     two

我想循环数据框并测试组差异(一对二)差异。每天都有一个独特的名称和饮料组合。因此,我想测试一下:

2015-02-13 John Beer 3 one 2015-02-14 John Beer 1一 2015-02-15 John Beer 1一

2015-02-13 John Beer 10二 2015-02-14 John Beer 10二 2015-02-15 John Beer 20两个

以及每个日期,名称和饮料组对。

我无法弄清楚如何实现这一目标:

for (i in 1:length(date)){ 
temp <- all[all$date==date[i],]

}

1 个答案:

答案 0 :(得分:2)

使用data.table

library(data.table)
setDT(all)

all[, t.test(consumed[version == "one"], consumed[version == "two"]), by = .(name,drinks)]
      name drinks  statistic parameter    p.value   conf.int  estimate null.value alternative                  method                                                 data.name
 1:    John   Beer -3.4320324  2.159744 0.06761534 -25.303554  1.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 2:    John   Beer -3.4320324  2.159744 0.06761534   1.970221 13.333333          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 3: Michael Coffee -0.2067737  3.960582 0.84638132 -28.960658  9.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 4: Michael Coffee -0.2067737  3.960582 0.84638132  24.960658 11.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 5:  Thomas    Tee  0.2208631  2.049375 0.84525800 -12.025434  4.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 6:  Thomas    Tee  0.2208631  2.049375 0.84525800  13.358768  4.000000          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 7:    John    Tee -1.3850647  2.070089 0.29640280 -61.453187  5.333333          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 8:    John    Tee -1.3850647  2.070089 0.29640280  30.786521 20.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
 9: Michael   Beer -0.6835859  2.210972 0.55885626 -45.015433  4.333333          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
10: Michael   Beer -0.6835859  2.210972 0.55885626  31.682100 11.000000          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
11:  Thomas Coffee  0.1942572  3.977345 0.85549254  -8.883193  4.000000          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
12:  Thomas Coffee  0.1942572  3.977345 0.85549254  10.216527  3.333333          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
13:    John Coffee -0.7570982  2.088564 0.52510317 -77.499374  6.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
14:    John Coffee -0.7570982  2.088564 0.52510317  53.499374 18.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
15: Michael    Tee -0.9049035  2.018804 0.46026242 -66.647341  2.666667          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
16: Michael    Tee -0.9049035  2.018804 0.46026242  43.314008 14.333333          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
17:  Thomas   Beer -0.7113284  2.110684 0.54726281 -29.270500  4.000000          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]
18:  Thomas   Beer -0.7113284  2.110684 0.54726281  20.603833  8.333333          0   two.sided Welch Two Sample t-test consumed[version == "one"] and consumed[version == "two"]

这对两个组(consumed[version == "one"], consumed[version == "two"])进行了t.test,按组(by = .(name,drinks)

结果有两行的原因是您的置信区间+估计值返回两个值。所有其他列都会重复。

我们可以通过包装list(...)将结果存储在data.table中作为列表来避免这种情况:

result <- all[, .(ttest = list(t.test(consumed[version == "one"], consumed[version == "two"]))), by = .(name,drinks)]
result
      name drinks   ttest
1:    John   Beer <htest>
2: Michael Coffee <htest>
3:  Thomas    Tee <htest>
4:    John    Tee <htest>
5: Michael   Beer <htest>
6:  Thomas Coffee <htest>
7:    John Coffee <htest>
8: Michael    Tee <htest>
9:  Thomas   Beer <htest>

然后我们可以用:

调用结果
result[name == "John" & drinks == "Beer", ttest]
[[1]]

    Welch Two Sample t-test

data:  consumed[version == "one"] and consumed[version == "two"]
t = -3.432, df = 2.1597, p-value = 0.06762
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -25.303554   1.970221
sample estimates:
mean of x mean of y 
 1.666667 13.333333