当我尝试根据小数据集(d2)从大型数据集(d1)中选择数据时出错。下面是我的脚本和问题。
**d1 <- read.table("MSv25.txt",header=T)
d2 <- read.table("Flairall.Gene.txt",header=T)
d2$low <- d2$start-10000 ; d2$high <- d2$end+10000
d1$matched <- apply(d1,1,function(p) which(p['POS'] >=d2[,'low'] & p['POS'] <= d2[,'high'] & p['CHR']==d2[,"chromosome"]))
d3 <- cbind(d1[which(d1$matched >0),], d2[unlist(d1$matched[which(d1$matched>0)]),])
write.table(d3,file="Flairall.GOBSgene.txt",quote=FALSE,sep="\t",row.names=FALSE,col.names=TRUE)**
d1是这样的:
SNP CHR POS A1 A2 OR P
rs1000007 2 237416793 C T 0.9785 0.4868
rs1000003 3 99825597 G A 0.9091 0.009774
rs1000002 3 185118462 C T 1.0111 0.6765
rs10000012 4 1347325 G C 1.0045 0.9087
rs10000042 4 5288053 T C 1.0622 0.3921
rs10000062 4 5305645 G C 1.0116 0.779
rs10000132 4 7450570 T C 0.9734 0.4748
rs10000081 4 16957461 C T 1.0288 0.3585
rs10000100 4 19119591 A G 1.0839 0.1417
rs10000010 4 21227772 C T 0.971 0.2693
rs10000092 4 21504615 C T 1.0342 0.27
rs10000068 4 36600682 T C 1.055 0.103
rs10000013 4 36901464 C A 1.0198 0.5379
rs10000037 4 38600725 A G 1.0249 0.4217
rs10000017 4 84997149 T C 0.9576 0.1912
rs10000109 4 91586292 A T 0.9956 0.9349
rs10000023 4 95952929 T G 0.9998 0.9951
rs10000030 4 103593179 A G 1.0839 0.04208
rs10000111 4 107137517 A G 1.0812 0.3128
rs10000124 4 109325900 A C 1.0642 0.1906
rs10000064 4 128029071 C T 1.0388 0.1578
rs10000029 4 138905074 C T 0.7832 0.14
rs10000036 4 139438712 T C 0.9848 0.5683
rs10000033 4 139819348 C T 0.9918 0.7542
rs10000121 4 157793485 A G 1.0008 0.9769
rs10000041 4 165841405 G T 1.0042 0.9146
rs10000082 4 167529642 T C 0.9733 0.6612
d2是这样的:
gene start end chromosome
WFDC9 237416000 237418000 2
SRGAP3 19119590 21504615 4
一般来说,我想通过在开始和结束位置延伸10kb的窗口来选择基因内的SNP。
这是我的脚本结果:
SNP CHR POS A1 A2 OR P matched gene start end chromosome low high
1 rs1000007 2 237416793 C T 0.9785 0.4868 1 WFDC9 237416000 237418000 2 237406000 237428000
哪个不正确,因为缺少一个基因。正确的应该是:
gene start end chromosome SNP CHR POS A1 A2 OR P
WFDC9 237416000 237418000 2 rs1000007 2 237416793 C T 0.9785 0.4868
SRGAP3 19119590 21504615 4 rs10000100 4 19119591 A G 1.0839 0.1417
SRGAP3 19119590 21504615 4 rs10000010 4 21227772 C T 0.971 0.2693
SRGAP3 19119590 21504615 4 rs10000092 4 21504615 C T 1.0342 0.27
任何人都可以帮我指出什么是错的......非常感谢......
答案 0 :(得分:2)
您的代码似乎完全落后于您要实现的目标:
“对于每个基因(在d2中)哪个SNP(来自d1)在该基因的10kb内?”
首先,d1$matched
的代码是向后的。你的所有p
和d2
都应该是相反的(目前它没有多大意义?),给你一个与每个基因顺式相关的SNP列表(+ / - 10kb)。
我会按照我提出问题的方式接近它:
cisWindow <- 10000 # size of your +/- window, in this case 10kb.
d3 <- data.frame()
# For each gene, locate the cis-SNPs
for (i in 1:nrow(d2)) {
# Broken down into steps for readability.
inCis <- d1[which(d1[,"CHR"] == d2[i, "chromosome"]),]
inCis <- inCis[which(inCis[,"POS"] >= (d2[i, "start"] - cisWindow)),]
inCis <- inCis[which(inCis[,"POS"] <= (d2[i, "end"] + cisWindow)),]
# Now we have the cis-SNPs, so lets build the data.frame for this gene,
# and grow our data.frame d3:
if (nrow(inCis) > 0) {
d3 <- rbind(d3, cbind(d2[i,], inCis))
}
}
我试图找到一个不涉及在循环中增长d3
的解决方案,但因为您将d2
的每一行附加到d1
的0行或更多行中无法提出一个效率不高的解决方案。