我目前有一个说明基因组中特定基因簇的数据框,它被定义为格式良好的制表符分隔文件,基本上看起来像下面的数据框(示例):
Gene Cluster Genome
-----------------------------
GCF3372 Streptomyces_hygroscopicus
GCF3450 Streptomyces_sp_Hm1069
GCF3371 Streptomyces_sp_MBT13
GCF3371 Streptomyces_xiamenensis
基于此,我想根据此数据帧创建一个缺失/存在表或列联表,其值分别为0和1,具体取决于基因组中特定基因簇的存在与否。整个想法是让我能够测量基因组中特定基因簇的出现,因此我想要一个存在/不存在表,以便能够对该矩阵进行统计分析。
x <- data.frame(gc = c('GCF3372','GCF3450','GCF3371','GCF3371','GCF3371'),
strain = c('Streptomyces_hygroscopicus', 'Streptomyces_sp_Hm1069',
'Streptomyces_sp_MBT13', 'Streptomyces_xiamenensis','Streptomyces_hygroscopicus'))
dput(head(x[, c(1,2)]))
答案 0 :(得分:0)
这是一种从两个分类变量中计算列联表的方法。出于说明目的,我将使用sex
和height
(它们在结构上似乎类似于您在数据框x
中具有的两个变量):
数据:
set.seed(300)
df <- data.frame(
Height = sample(c("tall", "very tall", "small", "very small"), 20, replace = T),
Sex = sample(c("m", "f"), 20, replace = T)
)
df
Height Sex
1 very tall f
2 very tall m
3 very tall m
4 tall f
5 very small m
6 tall f
7 tall m
8 very small f
9 small f
10 tall m
11 very small f
12 tall m
13 very small m
14 small f
15 very small m
16 small m
17 very small m
18 very small m
19 tall f
20 tall m
首先,如已经在注释中所述,使用table
将数据制成表格:
tbl <- table(df$Sex, df$Height); tbl
small tall very small very tall
f 2 3 2 1
m 1 4 5 2
然后,您可以将tbl
的第一行定义为新向量female
,将第二行定义为male
:
female <- tbl[1,]
male <- tbl[2,]
最后,您将二者行绑定到向量counts
,这是您的列联表:
counts <- rbind(female, male)
counts
small tall very small very tall
female 2 3 2 1
male 1 4 5 2
基于列联表,您可以运行测试,可能是卡方:
test <- chisq.test(counts); test
Pearson's Chi-squared test
data: counts
X-squared = 1.3492, df = 3, p-value = 0.7175