我希望使用简单的t检验为每个基因生成一组p值(每个基因一行)。我的数据框类似于以下内容:
SampleID Gene val Type
1 13366 GENE_A 3.15031629 Normal 2 13366 GENE_B 3.75717258 Normal 3 13366 GENE_C 3.57842994 Normal 4 13368 GENE_A 0.68801833 Affected 5 13368 GENE_B 2.78232529 Affected 6 13368 GENE_C 4.99150585 Affected 7 13370 GENE_A 3.22589363 Normal 8 13370 GENE_B 3.51548931 Normal 9 13370 GENE_C 3.93326487 Normal 10 34398 GENE_A 0.41194238 Affected 11 34398 GENE_B 3.23511072 Affected 12 34398 GENE_C 3.06637922 Affected 13 34400 GENE_A 3.26666659 Normal 14 34400 GENE_B 3.98581901 Normal 15 34400 GENE_C 3.94751765 Normal 16 34413 GENE_A 2.02822848 Affected 17 34413 GENE_B 2.97689035 Affected 18 34413 GENE_C 4.26453415 Affected
该组中的结果数据帧将只有3行(每个基因1个),并且p值的另一列比较每个基因的正常与受影响的值。我想用plyr理想地做这件事。有什么想法/建议吗?
答案 0 :(得分:1)
我认为你正在寻找这个:
df <- read.csv(textConnection("SampleID,Gene,val,Type
1,13366,GENE_A,3.15031629,Normal
2,13366,GENE_B,3.75717258,Normal
3,13366,GENE_C,3.57842994,Normal
4,13368,GENE_A,0.68801833,Affected
5,13368,GENE_B,2.78232529,Affected
6,13368,GENE_C,4.99150585,Affected
7,13370,GENE_A,3.22589363,Normal
8,13370,GENE_B,3.51548931,Normal
9,13370,GENE_C,3.93326487,Normal
10,34398,GENE_A,0.41194238,Affected
11,34398,GENE_B,3.23511072,Affected
12,34398,GENE_C,3.06637922,Affected
13,34400,GENE_A,3.26666659,Normal
14,34400,GENE_B,3.98581901,Normal
15,34400,GENE_C,3.94751765,Normal
16,34413,GENE_A,2.02822848,Affected
17,34413,GENE_B,2.97689035,Affected
18,34413,GENE_C,4.26453415,Affected"))
ddply(df,
.(Gene),
summarize,
pval= t.test(val[Type=='Normal'],val[Type == 'Affected'])$p.value)