R代码挑战:检索匹配列中的值并将它们与匹配的行相加

时间:2015-07-02 07:33:24

标签: r

我在R中解决这个问题。我有一个名为testa的数据框(包括dput)。我需要将列ALT中的所有字母与列号(A,C,G,T,N)匹配,并在这些列中获取相应的值以及REF个字母的值,并获得结果ad.new (我的代码做了这个工作)。

但是,我需要扩展此代码以解决最后TYPEflat行的问题。对于flat的行,我需要将其起始ID(chr10:102053031)与start列中的其他ID匹配。如果匹配,我需要从ALT列中总结A,C,G,T,N的相应值,并将其替换为扁平线的ad.new列以及REF值。

如果您运行dput和我的代码,您将能够理解它。所以基本上,我希望匹配REFALT列中的字母,并从列(A,C,G,T,N)中获取相应的值,并用逗号分隔这些值REFALT。但是(在此示例中),对于flat行,我想总结A列中的值,其匹配的起始ID为起始ID flat行(此例中的值为6)以及另一个匹配的值(此案例中的值为7列中的G)并将它们相加以得到13。因此,对于扁平线,我的结果应为0,13

预期结果如下所示。

我的不完整代码:

testa[is.na(testa)]<-0 
ref.counts<-testa[,testa[,"REF"]]
ref.counts<-as.matrix(Ref.counts) 
ref.counts[is.na(Ref.counts)]<-0
ref.counts<-diag(Ref.counts)

alt.counts<-testa[,testa[,"ALT"]]
alt.counts<-as.matrix(alt.counts)
alt.counts[is.na(alt.counts)]<-0
alt.counts<-diag(alt.counts)

#############
##need to extend this code here
#############
ad.new<-paste(Ref.counts,alt.counts,sep=",")

为testa输入:

structure(c("chr10:101544447", "chr10:102053031", "chr10:102778767", 
"chr10:102789831", "chr10:102989480", "chr10:102053031", "chr10:102053031", 
"0", "6", "0", "0", "0", "0", "0", "0", "34", "24", "0", "0", 
"34", "34", "0", "0", "0", "0", "0", "0", "7", "53", "0", "0", 
"30", "12", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", 
"chr10", "chr10", "chr10", "chr10", "chr10", "chr10", "chr10", 
"101544447", "102053031", "102778767", "102789831", "102989480", 
"102053031", "102053031", "A", "C", "C", "C", "C", "C", "C", 
"T", "A", "T", "T", "T", "G", "G", "snp", "snp", "snp", "snp", 
"snp", "snp:102053031:flat", "snp", "nonsynonymous SNV", 
"intronic", "nonsynonymous SNV", "nonsynonymous SNV", "ncRNA_exonic", 
"intronic", "intronic", "ABCC2:NM_000392:exon2:c.A116T:p.Y39F,", 
"PKD2L1", "PDZD7:NM_024895:exon8:c.G1136A:p.R379Q,PDZD7:NM_001195263:exon8:c.G1136A:p.R379Q,", 
"PDZD7:NM_024895:exon2:c.G146A:p.R49Q,PDZD7:NM_001195263:exon2:c.G146A:p.R49Q,", 
"LBX1-AS1", "PKD2L1", "PKD2L1"), .Dim = c(7L, 15L), .Dimnames = list(
    c("1", "2", "3", "4", "5", "6", "7"), c("start", "A", "C", 
    "G", "T", "N", "=", "-", "chr", "end", "REF", "ALT", "TYPE", 
    "refGene::location", "refGene::type")))

预期结果

 ad.new
"0,53"
"34,6"
"24,0"
"0,30"
"0,12"
"0,13" 
"34,7"

1 个答案:

答案 0 :(得分:2)

这样的事情应该有效:

# apply the "normal" rule (non considering flat exceptions)
alts <- as.numeric(diag(testa[,testa[,"ALT"]]))
refs <- as.numeric(diag(testa[,testa[,"REF"]]))
res <- paste(refs,alts,sep=",")

# replace lines having TYPE ending with "flat"
flats <- grep('.*flat$',testa[,"TYPE"])
res[flats] <- 
unlist(lapply(flats,function(x){
                startId <- testa[x,"start"]
                selection <- setdiff(which(testa[,"start"] == startId),r)
                paste0("0,",sum(alts[selection]))
             }))

ad.new <- as.matrix(res)
> ad.new
     [,1]  
[1,] "0,53"
[2,] "34,6"
[3,] "24,0"
[4,] "0,30"
[5,] "0,12"
[6,] "0,13"
[7,] "34,7"