使用公共标识符连接字符串

时间:2014-03-21 19:43:04

标签: r

s1 <- "A*01 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC"
s2 <- "A*01 TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC"
s3 <- "A*01 TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"

如何使用标识符&#34; A * 01&#34;来连接这些字符串?

预期产出:

sT <- "A*01 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"

3 个答案:

答案 0 :(得分:1)

gsub(" A\\*01 ", " ", paste(s1, s2, s3, sep=" ", collapse=""))
在这种情况下,

会做你想做的事,但我怀疑从长远来看你可能需要一个更通用的解决方案。

答案 1 :(得分:1)

试试这个

> concat <- paste(s1, sub("A[*]01 ", "", s2), sub("A[*]01 ", "", s3))
> identical(sT, concat)
[1] TRUE

concat看起来像这样

> concat
[1] "A*01 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"

答案 2 :(得分:1)

对于更通用的解决方案,我假设你有一堆看起来像问题中的那些行的文件。如果是这样,那么以下内容应该为您提供所需的信息。

library(stringr)
library(plyr)

dat <- readLines(textConnection("A*01 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC
A*01 TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC
A*01 TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG
A*02 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC
A*02 TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC
A*02 TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG
A*03 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC
A*04 TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC
A*04 TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"))


dat.df <- data.frame(prefix=str_match(dat, "(^A\\*[0-9]+) ")[,2],
                     sequence=str_match(dat, "\ (.*)$")[,2], stringsAsFactors=FALSE)

res <- daply(dat.df, .(prefix), .fun=function(x) {
  return(paste(x[1,]$prefix, paste(x$sequence, sep=" ", collapse=" "), 
               sep=" ", collapse=""))
})

names(res) <- NULL

print(res)

## [1] "A*01 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"
## [2] "A*02 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"
## [3] "A*03 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC"              
## [4] "A*04 TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"