不同样本量的两组基因配对之间的相关性

时间:2015-01-05 05:51:54

标签: r correlation

我有两组患者,其基因相同,但患者(样本)的昏暗程度不同。每组不同的样本是生物复制。

sample1 <- structure(c(3.990406045, 4.041745321, 4.002401404, 4.031463584, 
4.046886189, 4.00582865, 3.985265177, 3.9869788, 3.995546913, 
4.00582865, 11.75549075, 11.81394311, 11.81826206, 11.76013913, 
11.8408451, 11.83619671, 11.72858876, 11.73755609, 11.78239274, 
11.83619671, 8.647734791, 8.606480387, 8.64648886, 8.607548328, 
8.605946416, 8.646132879, 8.648268762, 8.648090771, 8.647200821, 
8.646132879, 5.359884744, 5.371302287, 5.37638989, 5.357155019, 
5.378375921, 5.381105646, 5.35281111, 5.355168988, 5.366958378, 
5.381105646, 8.805045323, 8.684889613, 8.794736874, 8.693725426, 
8.680471706, 8.791791603, 8.80946323, 8.807990594, 8.800627416, 
8.791791603, 10.87587031, 10.85539252, 10.87095037, 10.85960961, 
10.85328398, 10.86954467, 10.87797885, 10.877276, 10.87376176, 
10.86954467, 5.505422817, 5.530799682, 5.631682175, 5.422577376, 
5.584910836, 5.667756277, 5.451311664, 5.469348715, 5.559533971, 
5.667756277), .Dim = c(10L, 7L), .Dimnames = list(c("patient1", 
"patient2", "patient3", "patient4", 
"patient5", "patient6", "patient7", 
"patient8", "patient9", "patient10"
), c("gene1", "gene2", "gene3", "gene4", "gene5", "gene6", "gene7"
)))

sample2 <- structure(c(3.990406045, 4.041745321, 4.002401404, 4.031463584, 
4.046886189, 4.00582865, 3.985265177, 3.9869788, 11.75549075, 
11.81394311, 11.81826206, 11.76013913, 11.8408451, 11.83619671, 
11.72858876, 11.73755609, 8.647734791, 8.606480387, 8.64648886, 
8.607548328, 8.605946416, 8.646132879, 8.648268762, 8.648090771, 
5.359884744, 5.371302287, 5.37638989, 5.357155019, 5.378375921, 
5.381105646, 5.35281111, 5.355168988, 8.805045323, 8.684889613, 
8.794736874, 8.693725426, 8.680471706, 8.791791603, 8.80946323, 
8.807990594, 10.87587031, 10.85539252, 10.87095037, 10.85960961, 
10.85328398, 10.86954467, 10.87797885, 10.877276, 5.505422817, 
5.530799682, 5.631682175, 5.422577376, 5.584910836, 5.667756277, 
5.451311664, 5.469348715), .Dim = c(8L, 7L), .Dimnames = list(
c("patient1", 
"patient2", "patient3", "patient4", 
"patient5", "patient6", "patient7", 
"patient8"), c("gene1", "gene2", "gene3", "gene4", "gene5", "gene6", "gene7")))

现在,我想检查两组间基因对之间的相关性

rcorr(sample1, sample2, type="s")#spearman

我很欣慰:

Error in cbind(x, y) : number of rows of matrices must match (see arg 2)

但是对于患者的相关性t(样本)带来了患者对之间的相关性。我需要基因对之间的相关性(下图)。有什么不对的吗?我应该考虑一些统计点吗?

当患者的长度相等时,我会反复这样说:

> rcorr(sample1[1:8, ], sample2[1:8,], type="s")
      gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene1 gene2 gene3 gene4 gene5 gene6 gene7
gene1  1.00  0.81 -1.00  0.67 -1.00 -1.00  0.36  1.00  0.81 -1.00  0.67 -1.00 -1.00  0.36
gene2  0.81  1.00 -0.81  0.95 -0.81 -0.81  0.79  0.81  1.00 -0.81  0.95 -0.81 -0.81  0.79
gene3 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36
gene4  0.67  0.95 -0.67  1.00 -0.67 -0.67  0.90  0.67  0.95 -0.67  1.00 -0.67 -0.67  0.90
gene5 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36
gene6 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36
gene7  0.36  0.79 -0.36  0.90 -0.36 -0.36  1.00  0.36  0.79 -0.36  0.90 -0.36 -0.36  1.00
gene1  1.00  0.81 -1.00  0.67 -1.00 -1.00  0.36  1.00  0.81 -1.00  0.67 -1.00 -1.00  0.36
gene2  0.81  1.00 -0.81  0.95 -0.81 -0.81  0.79  0.81  1.00 -0.81  0.95 -0.81 -0.81  0.79
gene3 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36
gene4  0.67  0.95 -0.67  1.00 -0.67 -0.67  0.90  0.67  0.95 -0.67  1.00 -0.67 -0.67  0.90
gene5 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36
gene6 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36 -1.00 -0.81  1.00 -0.67  1.00  1.00 -0.36
gene7  0.36  0.79 -0.36  0.90 -0.36 -0.36  1.00  0.36  0.79 -0.36  0.90 -0.36 -0.36  1.00

正如所见,矩阵也存在重复。为什么呢?

2 个答案:

答案 0 :(得分:2)

关于问题的最后部分:rcorr按列绑定矩阵sample1sample2,并使用组合矩阵计算排名相关系数。如果你给sample1和sample2中的基因命名不同,例如:

colnames(sample1) <- sprintf('sample1.%s',colnames(sample1))
colnames(sample2) <- sprintf('sample2.%s',colnames(sample2)) 

你会看到你有一个块矩阵,其对角线块对应于每个样本内的系数(sample1 - sample1sample2 - sample2),以及非对角线块 - 到sample1sample2之间的系数。

rcorr(sample1[1:8,],sample2[1:8,],type='s')

              sample1.gene1 sample1.gene2 sample1.gene3 sample1.gene4 sample1.gene5 sample1.gene6 sample1.gene7 sample2.gene1 sample2.gene2 sample2.gene3 sample2.gene4
sample1.gene1          1.00          0.81         -1.00          0.67         -1.00         -1.00          0.36          1.00          0.81         -1.00          0.67
sample1.gene2          0.81          1.00         -0.81          0.95         -0.81         -0.81          0.79          0.81          1.00         -0.81          0.95
sample1.gene3         -1.00         -0.81          1.00         -0.67          1.00          1.00         -0.36         -1.00         -0.81          1.00         -0.67
sample1.gene4          0.67          0.95         -0.67          1.00         -0.67         -0.67          0.90          0.67          0.95         -0.67          1.00
sample1.gene5         -1.00         -0.81          1.00         -0.67          1.00          1.00         -0.36         -1.00         -0.81          1.00         -0.67
sample1.gene6         -1.00         -0.81          1.00         -0.67          1.00          1.00         -0.36         -1.00         -0.81          1.00         -0.67
sample1.gene7          0.36          0.79         -0.36          0.90         -0.36         -0.36          1.00          0.36          0.79         -0.36          0.90
sample2.gene1          1.00          0.81         -1.00          0.67         -1.00         -1.00          0.36          1.00          0.81         -1.00          0.67
sample2.gene2          0.81          1.00         -0.81          0.95         -0.81         -0.81          0.79          0.81          1.00         -0.81          0.95
sample2.gene3         -1.00         -0.81          1.00         -0.67          1.00          1.00         -0.36         -1.00         -0.81          1.00         -0.67
sample2.gene4          0.67          0.95         -0.67          1.00         -0.67         -0.67          0.90          0.67          0.95         -0.67          1.00
sample2.gene5         -1.00         -0.81          1.00         -0.67          1.00          1.00         -0.36         -1.00         -0.81          1.00         -0.67
sample2.gene6         -1.00         -0.81          1.00         -0.67          1.00          1.00         -0.36         -1.00         -0.81          1.00         -0.67
sample2.gene7          0.36          0.79         -0.36          0.90         -0.36         -0.36          1.00          0.36          0.79         -0.36          0.90
              sample2.gene5 sample2.gene6 sample2.gene7
sample1.gene1         -1.00         -1.00          0.36
sample1.gene2         -0.81         -0.81          0.79
sample1.gene3          1.00          1.00         -0.36
sample1.gene4         -0.67         -0.67          0.90
sample1.gene5          1.00          1.00         -0.36
sample1.gene6          1.00          1.00         -0.36
sample1.gene7         -0.36         -0.36          1.00
sample2.gene1         -1.00         -1.00          0.36
sample2.gene2         -0.81         -0.81          0.79
sample2.gene3          1.00          1.00         -0.36
sample2.gene4         -0.67         -0.67          0.90
sample2.gene5          1.00          1.00         -0.36
sample2.gene6          1.00          1.00         -0.36
sample2.gene7         -0.36         -0.36          1.00

在您的示例中,sample1sample2完全相同,因此这就是为什么所有块矩阵都相等。

<强>更新: sample1-sample2相关性可以使用cor函数计算:

library(reshape2)

# produce all combinations of column indices for sample1 and sample2
z <- expand.grid(s1=1:7,s2=1:7) 

# due to the correlation matrix symmetry, we can calculate only an upper right trigonal matrix
z <- z[z$s2<z$s1,]

# calculate correlations
z$corr <- mapply(function(i,j) cor(sample1[1:8,i],sample2[1:8,j],method='spearman'),z$s1,z$s2) 

# reshape the result into a trigonal matrix
corr.coefs <- dcast(z,s2~s1,value.var='corr') 

答案 1 :(得分:2)

您无法找到具有不同长度的两个向量之间的相关性,相关性需要计算成对数据,并且它适用于所有相关方法。

不推荐缺少价值估计和时间序列模型(例如GARCH),因为您使用的是生物数据,并且不同患者之间的模式可能不同,并且这些方法无法考虑可能改变现象的所有因素。

我认为,最佳解决方案是删除额外数据并使两个样本具有相同的患者编号。 R具有内置功能cor。其中的“use”参数可帮助您忽略NA值。

以下是链接:http://www.statmethods.net/stats/correlations.html