我有一组字符串,其ID以>
开头。我想在一行上得到每个ID后面的字符串,而不是像现在这样在多行上分开。该字符串有时可以在1,2或3行分开。
fileName="hairpin"
conn=file(fileName,open="r")
linn=readLines(conn)
for (i in 1:length(linn)){
print(linn[i])
}
close(conn)
head(linn)
[1] ">cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop"
[2] "UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAAC"
[3] "UAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA"
[4] ">cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop"
[5] "AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCU"
[6] "GGGCUCUCCGGGUACCAGGACGGUUUGAGCAGAU
输出
[1] ">cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop" "UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA"
[4] ">cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop" "AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCUGGGCUCUCCGGGUACCAGGACGGUUUGAGCAGAU"
我在anothet网站上找到了解决方案:
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < file.fa
答案 0 :(得分:1)
试试这个:
g <- cumsum(grepl("^>", Lines)) # equals 1 for first group, 2 for second, etc.
unname(unlist(tapply(Lines, g, function(x) c(x[1], paste(x[-1], collapse = "")))))
,并提供:
[1] ">cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop"
[2] "UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA"
[3] ">cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop"
[4] "AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCUGGGCUCUCCGGGUACCAGGACGGUUUGAGCAGAU"
注意输入Lines
为:
Lines <- c(">cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop",
"UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAAC",
"UAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA",
">cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop",
"AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCU",
"GGGCUCUCCGGGUACCAGGACGGUUUGAGCAGAU")