我有两组患者,其基因相同,但患者(样本)的昏暗程度不同。每组不同的样本是生物复制。
sample1 <- structure(c(3.990406045, 4.041745321, 4.002401404, 4.031463584,
4.046886189, 4.00582865, 3.985265177, 3.9869788, 3.995546913,
4.00582865, 11.75549075, 11.81394311, 11.81826206, 11.76013913,
11.8408451, 11.83619671, 11.72858876, 11.73755609, 11.78239274,
11.83619671, 8.647734791, 8.606480387, 8.64648886, 8.607548328,
8.605946416, 8.646132879, 8.648268762, 8.648090771, 8.647200821,
8.646132879, 5.359884744, 5.371302287, 5.37638989, 5.357155019,
5.378375921, 5.381105646, 5.35281111, 5.355168988, 5.366958378,
5.381105646, 8.805045323, 8.684889613, 8.794736874, 8.693725426,
8.680471706, 8.791791603, 8.80946323, 8.807990594, 8.800627416,
8.791791603, 10.87587031, 10.85539252, 10.87095037, 10.85960961,
10.85328398, 10.86954467, 10.87797885, 10.877276, 10.87376176,
10.86954467, 5.505422817, 5.530799682, 5.631682175, 5.422577376,
5.584910836, 5.667756277, 5.451311664, 5.469348715, 5.559533971,
5.667756277), .Dim = c(10L, 7L), .Dimnames = list(c("patient1",
"patient2", "patient3", "patient4",
"patient5", "patient6", "patient7",
"patient8", "patient9", "patient10"
), c("gene1", "gene2", "gene3", "gene4", "gene5", "gene6", "gene7"
)))
和
sample2 <- structure(c(3.990406045, 4.041745321, 4.002401404, 4.031463584,
4.046886189, 4.00582865, 3.985265177, 3.9869788, 11.75549075,
11.81394311, 11.81826206, 11.76013913, 11.8408451, 11.83619671,
11.72858876, 11.73755609, 8.647734791, 8.606480387, 8.64648886,
8.607548328, 8.605946416, 8.646132879, 8.648268762, 8.648090771,
5.359884744, 5.371302287, 5.37638989, 5.357155019, 5.378375921,
5.381105646, 5.35281111, 5.355168988, 8.805045323, 8.684889613,
8.794736874, 8.693725426, 8.680471706, 8.791791603, 8.80946323,
8.807990594, 10.87587031, 10.85539252, 10.87095037, 10.85960961,
10.85328398, 10.86954467, 10.87797885, 10.877276, 5.505422817,
5.530799682, 5.631682175, 5.422577376, 5.584910836, 5.667756277,
5.451311664, 5.469348715), .Dim = c(8L, 7L), .Dimnames = list(
c("patient1",
"patient2", "patient3", "patient4",
"patient5", "patient6", "patient7",
"patient8"), c("gene1", "gene2", "gene3", "gene4", "gene5", "gene6", "gene7")))
现在,我想检查两组间基因对之间的相关性
rcorr(sample1, sample2, type="s")#spearman
我很欣慰:
Error in cbind(x, y) : number of rows of matrices must match (see arg 2)
但是对于患者的相关性t(样本)带来了患者对之间的相关性。我需要基因对之间的相关性(下图)。有什么不对的吗?我应该考虑一些统计点吗?
当患者的长度相等时,我会反复这样说:
> rcorr(sample1[1:8, ], sample2[1:8,], type="s")
gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene1 gene2 gene3 gene4 gene5 gene6 gene7
gene1 1.00 0.81 -1.00 0.67 -1.00 -1.00 0.36 1.00 0.81 -1.00 0.67 -1.00 -1.00 0.36
gene2 0.81 1.00 -0.81 0.95 -0.81 -0.81 0.79 0.81 1.00 -0.81 0.95 -0.81 -0.81 0.79
gene3 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36
gene4 0.67 0.95 -0.67 1.00 -0.67 -0.67 0.90 0.67 0.95 -0.67 1.00 -0.67 -0.67 0.90
gene5 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36
gene6 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36
gene7 0.36 0.79 -0.36 0.90 -0.36 -0.36 1.00 0.36 0.79 -0.36 0.90 -0.36 -0.36 1.00
gene1 1.00 0.81 -1.00 0.67 -1.00 -1.00 0.36 1.00 0.81 -1.00 0.67 -1.00 -1.00 0.36
gene2 0.81 1.00 -0.81 0.95 -0.81 -0.81 0.79 0.81 1.00 -0.81 0.95 -0.81 -0.81 0.79
gene3 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36
gene4 0.67 0.95 -0.67 1.00 -0.67 -0.67 0.90 0.67 0.95 -0.67 1.00 -0.67 -0.67 0.90
gene5 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36
gene6 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36
gene7 0.36 0.79 -0.36 0.90 -0.36 -0.36 1.00 0.36 0.79 -0.36 0.90 -0.36 -0.36 1.00
正如所见,矩阵也存在重复。为什么呢?
答案 0 :(得分:2)
关于问题的最后部分:rcorr
按列绑定矩阵sample1
和sample2
,并使用组合矩阵计算排名相关系数。如果你给sample1和sample2中的基因命名不同,例如:
colnames(sample1) <- sprintf('sample1.%s',colnames(sample1))
colnames(sample2) <- sprintf('sample2.%s',colnames(sample2))
你会看到你有一个块矩阵,其对角线块对应于每个样本内的系数(sample1
- sample1
和sample2
- sample2
),以及非对角线块 - 到sample1
和sample2
之间的系数。
rcorr(sample1[1:8,],sample2[1:8,],type='s')
sample1.gene1 sample1.gene2 sample1.gene3 sample1.gene4 sample1.gene5 sample1.gene6 sample1.gene7 sample2.gene1 sample2.gene2 sample2.gene3 sample2.gene4
sample1.gene1 1.00 0.81 -1.00 0.67 -1.00 -1.00 0.36 1.00 0.81 -1.00 0.67
sample1.gene2 0.81 1.00 -0.81 0.95 -0.81 -0.81 0.79 0.81 1.00 -0.81 0.95
sample1.gene3 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67
sample1.gene4 0.67 0.95 -0.67 1.00 -0.67 -0.67 0.90 0.67 0.95 -0.67 1.00
sample1.gene5 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67
sample1.gene6 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67
sample1.gene7 0.36 0.79 -0.36 0.90 -0.36 -0.36 1.00 0.36 0.79 -0.36 0.90
sample2.gene1 1.00 0.81 -1.00 0.67 -1.00 -1.00 0.36 1.00 0.81 -1.00 0.67
sample2.gene2 0.81 1.00 -0.81 0.95 -0.81 -0.81 0.79 0.81 1.00 -0.81 0.95
sample2.gene3 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67
sample2.gene4 0.67 0.95 -0.67 1.00 -0.67 -0.67 0.90 0.67 0.95 -0.67 1.00
sample2.gene5 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67
sample2.gene6 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67
sample2.gene7 0.36 0.79 -0.36 0.90 -0.36 -0.36 1.00 0.36 0.79 -0.36 0.90
sample2.gene5 sample2.gene6 sample2.gene7
sample1.gene1 -1.00 -1.00 0.36
sample1.gene2 -0.81 -0.81 0.79
sample1.gene3 1.00 1.00 -0.36
sample1.gene4 -0.67 -0.67 0.90
sample1.gene5 1.00 1.00 -0.36
sample1.gene6 1.00 1.00 -0.36
sample1.gene7 -0.36 -0.36 1.00
sample2.gene1 -1.00 -1.00 0.36
sample2.gene2 -0.81 -0.81 0.79
sample2.gene3 1.00 1.00 -0.36
sample2.gene4 -0.67 -0.67 0.90
sample2.gene5 1.00 1.00 -0.36
sample2.gene6 1.00 1.00 -0.36
sample2.gene7 -0.36 -0.36 1.00
在您的示例中,sample1
和sample2
完全相同,因此这就是为什么所有块矩阵都相等。
<强>更新强>:
sample1-sample2相关性可以使用cor
函数计算:
library(reshape2)
# produce all combinations of column indices for sample1 and sample2
z <- expand.grid(s1=1:7,s2=1:7)
# due to the correlation matrix symmetry, we can calculate only an upper right trigonal matrix
z <- z[z$s2<z$s1,]
# calculate correlations
z$corr <- mapply(function(i,j) cor(sample1[1:8,i],sample2[1:8,j],method='spearman'),z$s1,z$s2)
# reshape the result into a trigonal matrix
corr.coefs <- dcast(z,s2~s1,value.var='corr')
答案 1 :(得分:2)
您无法找到具有不同长度的两个向量之间的相关性,相关性需要计算成对数据,并且它适用于所有相关方法。
不推荐缺少价值估计和时间序列模型(例如GARCH),因为您使用的是生物数据,并且不同患者之间的模式可能不同,并且这些方法无法考虑可能改变现象的所有因素。
我认为,最佳解决方案是删除额外数据并使两个样本具有相同的患者编号。 R具有内置功能cor。其中的“use”参数可帮助您忽略NA值。