合并由多行分隔的字符串

时间:2014-09-27 20:05:03

标签: r

我有一组字符串,其ID以>开头。我想在一行上得到每个ID后面的字符串,而不是像现在这样在多行上分开。该字符串有时可以在1,2或3行分开。

fileName="hairpin"
conn=file(fileName,open="r")
linn=readLines(conn)
for (i in 1:length(linn)){
 print(linn[i])
}
close(conn)
head(linn)

[1] ">cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop" 
[2] "UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAAC"
[3] "UAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA"                     
[4] ">cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop" 
[5] "AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCU"
[6] "GGGCUCUCCGGGUACCAGGACGGUUUGAGCAGAU

输出

[1] ">cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop"  "UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA"                     
[4] ">cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop"  "AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCUGGGCUCUCCGGGUACCAGGACGGUUUGAGCAGAU"

我在anothet网站上找到了解决方案:

 awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa

1 个答案:

答案 0 :(得分:1)

试试这个:

g <- cumsum(grepl("^>", Lines)) # equals 1 for first group, 2 for second, etc.
unname(unlist(tapply(Lines, g, function(x) c(x[1], paste(x[-1], collapse = "")))))

,并提供:

[1] ">cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop"                                        
[2] "UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA"
[3] ">cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop"                                        
[4] "AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCUGGGCUCUCCGGGUACCAGGACGGUUUGAGCAGAU"     

注意输入Lines为:

Lines <- c(">cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop",
"UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAAC",
"UAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA",
">cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop",
"AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCU",
"GGGCUCUCCGGGUACCAGGACGGUUUGAGCAGAU")