Question

我有两个数据框，一个用于SNP，一个用于基因。

对于每个基因，如果SNP位置在窗口大小内，我想将该行返回到数据帧。然后我想找到特定SNP与该基因之间的相关性（如果它在窗口中）。我目前正在使用R。

基因数据框：

Chr Start   End sample1 sample2 sample3
10  100015109   100015443   2   1   1
10  100365832   100368960   1   0   2
10  100486970   100487277   2   1   0

SNP数据框：

SNP CHROM   POSITION    sample1 sample2 sample3
rs3766180   1   1478153 1   1   2
rs7540231   1   1506035 2   2   0
rs2272908   1   1721479 1   1   2
rs10907187  1   1759054 0   1   2

到目前为止我有这个代码，但我不确定我是否正在进行正确的迭代。我想迭代基因并检查哪些snps位于窗口大小内，并找到该snp与该基因之间的r平方。例如，如果snp1的位置位于基因的起始和结束范围内，则选择该行，然后在这两行之间找到r平方。我认为我的循环是错误的，可能有一种更简单的方法。请帮忙。

snps <- as.matrix(read.table("snps.txt", header=T, sep="\t"))
genes <- as.matrix(read.table("genes.txt", header=T, sep="\t"))

#Set upper and lower bounds
size = 1000000
window_left = genes$cnvStart - size
window_right = genes$cnvEnd + size
snp_pos <- snps$POS
snp <- snps$ID


for (s in 1:nrow(snps)){
  for(g in 1:nrow(genes)){
    if (snp_pos >window_left & snp_pos < window_right){
         corr.matrix2 <- (cor(t(s),t(g),use="pairwise.complete.obs", method="pearson"))
      new_snps <- cbind(snp, snps[,-3])
    }
  }
}

我想要的输出是每个选定的snp基因比较的r平方值表。任何想法都将不胜感激。

谢谢，内华达州

Answer 1

我复制你的代码并对其进行评论

snps <- as.matrix(read.table("snps.txt", header=T, sep="\t"))
genes <- as.matrix(read.table("genes.txt", header=T, sep="\t"))

这没有错，但最好在名称中明确说明哪种文件，如果它们被制表符分隔，则它们是tsv文件（ t ab s eparated f iles）。这样，您可以使用其他程序（Microsoft Excel或类似程序）轻松打开它们

#Set upper and lower bounds
size = 1000000
window_left = genes$cnvStart - size
window_right = genes$cnvEnd + size
snp_pos <- snps$POS
snp <- snps$ID

在这里设置变量，但是你得到了矢量，所以snp或snp_pos是矢量。如果您想稍后使用它，您必须知道您想要哪种数据。

for (s in 1:nrow(snps)){
  for(g in 1:nrow(genes)){

获取数据框所需的信息后，您可以通过snps行数和基因行数进行迭代。为什么不使用snp_pos和snp变量？

    if (snp_pos >window_left & snp_pos < window_right){

在这里你要比较你想要的所有，你不需要前两个for循环。

         corr.matrix2 <- (cor(t(s),t(g),use="pairwise.complete.obs", method="pearson"))

您不使用选定的变量来创建成对相关。你应该使用你的变量。我还建议绘制相关性以进行视觉比较。（你可能也需要它们）

      new_snps <- cbind(snp, snps[,-3])
    }
  }
}

这不会创建一个表，它连接到一个不是表的数据帧中的向量。

我没有测试过，但我会做这样的事情：

snps <- as.matrix(read.table("snps.txt", header=T, sep="\t"))
genes <- as.matrix(read.table("genes.txt", header=T, sep="\t"))

#Set upper and lower bounds
size = 1000000
window_left = genes$cnvStart - size
window_right = genes$cnvEnd + size

in_window = snps[snps$POS >window_left & snps$POS < window_right]
corr.matrix2 <- (cor(in_window$, in_window$ ,use="pairwise.complete.obs", method="pearson"))

我真的不知道你想要做哪种关联，所以你应该改变cor函数的前两个参数（不完整的in_window $）。我想你想比较哪些样品有哪个SNP。但这是另一个问题;）

Answer 2

好的还有一点我不清楚。

首先：SNP数据框中的任何位置都不在Start数据框的End和Genes范围内 - 我已经做了一个示例。

第二：您想使用该行与sample1,2和3下的另一行相关联吗？

e.i如果你想要这些排。

Chr Start   End sample1 sample2 sample3
10  100015109   100015443   2   1   1 <----  THIS ROW?

SNP CHROM   POSITION    sample1 sample2 sample3
rs3766180   1   1478153 1   1   2    <---- AND THIS ROW?

My understanding is that you want to correlate   2   1   1  with  1   1   2

我现在有一个有效的例子：

Genes<-data.frame(Chr=c(10,10,10),Start=c(100015109,100365832,100486970),End=c(100015443,100368960,100487277),sample1=c(2,1,2),sample2=c(1,0,1),sample3=c(1,2,0))
SNP <- data.frame(SNP= c("rs3766180","rs7540231","rs2272908"),CHROM=c(1,1,1),POSITION=c(100015200,100365831,100486971),sample1=c(1,2,1),sample2=c(1,2,1),sample3=c(2,0,2))

> Genes
  Chr     Start       End sample1 sample2 sample3
1  10 100015109 100015443       2       1       1
2  10 100365832 100368960       1       0       2
3  10 100486970 100487277       2       1       0
> SNP
        SNP CHROM  POSITION sample1 sample2 sample3
1 rs3766180     1 100015200       1       1       2
2 rs7540231     1 100365831       2       2       0
3 rs2272908     1 100486971       1       1       2

CorTestMatrix <- data.frame()

for (igene in 1:nrow(Genes)) { # for every gene
        curGeneRow <- Genes[igene ,] # take that row
        for (isnp in 1:nrow(SNP)) { # for every SNP
                cursnp <- SNP[isnp ,] # take that row of SNP
                if (cursnp$POSITION > curGeneRow$Start & curGeneRow$End > cursnp$POSITION) { # is the SNP in the Gene Window=
                        x<-as.numeric(as.vector(cursnp[,4:ncol(cursnp)])) # if you want the row from Position, 
                        y<-as.numeric(as.vector(curGeneRow[,4:ncol(curGeneRow)])) # and want the row from End
                        corTest <- cor.test(x,y) # correlate those two 
                        CurTestMatrix <- data.frame(GeneChr=curGeneRow$Chr,SNP=levels(droplevels(cursnp$SNP)),test=as.numeric(corTest[3]))
                        # saves some info from both Dataframes and the p Value from the cortest.
                        # you need edit this to get additional data. 
                        CorTestMatrix <- smartbind(CorTestMatrix,CurTestMatrix)

                }

        }
}

> CorTestMatrix
    GeneChr       SNP      test
1:1      10 rs3766180 0.6666667
1:2      10 rs3766180 0.6666667
2        10 rs2272908 0.3333333

可能有一种更快的方法可以做到这一点，但for循环很容易编辑和使用。我已经做到了这一点，SNP的第一和第三行应分别在GeneRow 1和3的开始和结束范围内分别进行2次相关测试。

如果需要，我建议纠正非正态分布的样本：

sp_Cov1 <- shapiro.test(x);sp_Cov2 <- shapiro.test(y) # correction for non-normallity
if(sp_Cov1[2] < 0.05 | sp_Cov2[2] < 0.05) {correlationToUse = 'kendall'
} else {correlationToUse = 'pearson'}

 corTest <- cor.test(x,y,method=correlationToUse)

避免对p$value

进行偏差估算

如何查找符合条件的所有行并返回R中的匹配行？

2 个答案: