字符串被管道分割并放入列中

时间:2021-02-11 16:47:03

标签: r

我有这个字符串向量。我想将它们按 | 拆分并将字段 2、3、4 和 10 提取到四个不同的列中。我可以用 unlist(strsplit(test,split='|',fixed=TRUE))[c(2:4,10)] 拆分第一个字符串(test [1]),但我不确定如何处理向量中的所有字符串。任何帮助将不胜感激。

test <- c("PR;ANN=T|splice_region_variant&intron_variant|LOW|PER3|ENSG00000049246|transcript|ENST00000377532|protein_coding|13/20|c.1658+7G>T||||||,T|splice_region_variant&intron_variant|LOW|PER3|ENSG00000049246|transcript|ENST00000613533|protein_coding|14/21|c.1658+7G>T||||||,T|splice_region_variant&intron_variant|LOW|PER3|ENSG00000049246|transcript|ENST00000614998|protein_coding|14/22|c.1658+7G>T||||||,T|splice_region_variant&intron_variant|LOW|PER3|ENSG00000049246|transcript|ENST00000361923|protein_coding|13/20|c.1634+7G>T||||||,T|intron_variant|MODIFIER|RP3-467L1.4|ENSG00000236266|transcript|ENST00000451646|antisense|1/2|n.239+7677C>A||||||;AC=64;AC_AFR=1;AC_AMR=0;AC_Adj=64;AC_EAS=0;AC_FIN=0;AC_Het=64;AC_Hom=0;AC_NFE=63;AC_OTH=0;AC_SAS=0;AF=5.271e-04;AN=121410;AN_AFR=10404;AN_AMR=11578;AN_Adj=121084;AN_EAS=8652;AN_FIN=6614;AN_NFE=66616;AN_OTH=906;AN_SAS=16314;CSQ=A|ENSG00000236266|ENST00000451646|Transcript|intron_variant&non_coding_transcript_variant||||||rs200733001|2||-1|RP3-467L1.4|Clo... <truncated>
"PR;ANN=G|intron_variant|MODIFIER|PIGK|ENSG00000142892|transcript|ENST00000370812|protein_coding|10/10|c.1072-59T>C||||||,G|intron_variant|MODIFIER|PIGK|ENSG00000142892|transcript|ENST00000445065|protein_coding|7/7|c.790-59T>C||||||,G|intron_variant|MODIFIER|PIGK|ENSG00000142892|transcript|ENST00000487906|nonsense_mediated_decay|6/6|n.*561-59T>C||||||"
)

2 个答案:

答案 0 :(得分:2)

如果有多个元素,循环遍历用 list 创建的 strsplit,提取 ([)、元素和 rbind 到 {{ 1}}

matrix

答案 1 :(得分:1)

试试这个 lapply 表达式:

lapply(strsplit(test, '\\|'), function(x) x[c(2:4,10)])
[[1]]
[1] "splice_region_variant&intron_variant" "LOW"                                 
[3] "PER3"                                 "c.1658+7G>T"                         

[[2]]
[1] "intron_variant" "MODIFIER"       "PIGK"           "c.1072-59T>C"

或者,使用 sapply

sapply(strsplit(test, '\\|'), function(x) x[c(2:4,10)])
     [,1]                                   [,2]            
[1,] "splice_region_variant&intron_variant" "intron_variant"
[2,] "LOW"                                  "MODIFIER"      
[3,] "PER3"                                 "PIGK"          
[4,] "c.1658+7G>T"                          "c.1072-59T>C"

数据:

test <- c("PR;ANN=T|splice_region_variant&intron_variant|LOW|PER3|ENSG00000049246|transcript|ENST00000377532|protein_coding|13/20|c.1658+7G>T||||||,T|splice_region_variant&intron_variant|LOW|PER3|ENSG00000049246|transcript|ENST00000613533|protein_coding|14/21|c.1658+7G>T||||||,T|splice_region_variant&intron_variant|LOW|PER3|ENSG00000049246|transcript|ENST00000614998|protein_coding|14/22|c.1658+7G>T||||||,T|splice_region_variant&intron_variant|LOW|PER3|ENSG00000049246|transcript|ENST00000361923|protein_coding|13/20|c.1634+7G>T||||||,T|intron_variant|MODIFIER|RP3-467L1.4|ENSG00000236266|transcript|ENST00000451646|antisense|1/2|n.239+7677C>A||||||;AC=64;AC_AFR=1;AC_AMR=0;AC_Adj=64;AC_EAS=0;AC_FIN=0;AC_Het=64;AC_Hom=0;AC_NFE=63;AC_OTH=0;AC_SAS=0;AF=5.271e-04;AN=121410;AN_AFR=10404;AN_AMR=11578;AN_Adj=121084;AN_EAS=8652;AN_FIN=6614;AN_NFE=66616;AN_OTH=906;AN_SAS=16314;CSQ=A|ENSG00000236266|ENST00000451646|Transcript|intron_variant&non_coding_transcript_variant||||||rs200733001|2||-1|RP3-467L1.4|Clo... <truncated>",
"PR;ANN=G|intron_variant|MODIFIER|PIGK|ENSG00000142892|transcript|ENST00000370812|protein_coding|10/10|c.1072-59T>C||||||,G|intron_variant|MODIFIER|PIGK|ENSG00000142892|transcript|ENST00000445065|protein_coding|7/7|c.790-59T>C||||||,G|intron_variant|MODIFIER|PIGK|ENSG00000142892|transcript|ENST00000487906|nonsense_mediated_decay|6/6|n.*561-59T>C||||||")