我想从数据框的列中提取特定信息,并将其添加到同一数据框中的新列。复杂性在于某些行根本没有我要提取的信息(“UniProt:”之后的6个字符),而其他行有多次出现 - 我希望这些行相应地显示,因为此列包含标识符在我的数据框中。
这是一个例子;我从我的数据框中复制了几行Fasta.headers:
第1行:
H05C05.1c; CE43771; WBGene00019157;状态:Partially_confirmed; UniProt的:H2L0A8; protein_id:CCD72193.1;> H05C05.1a; CE37385; WBGene00019157;状态:Partially_confirmed; UniProt的:Q9TXU2; protein_id:CCD72188.1
第2行:
C02B10.5; CE16802; WBGene00015330;状态:Partially_confirmed; UniProt的:O44447; protein_id:CCD61167.1
第3行:
ZK1127.4; CE07643; WBGene00022851;状态:成熟; protein_id:CCD73716.1
第4行:
T27C4.4a; CE21211; WBGene00003025;轨迹:LIN-40;状态:成熟; UniProt的:O61907; protein_id:CCD74255.1;> T27C4.4b; CE21212; WBGene00003025;轨迹:LIN-40;状态:成熟; UniProt的:Q76NP4; protein_id:CCD74256.1;> T27C4.4d; CE33331;> F54F2.9; CE39158; WBGene00018836;状态:成熟; UniProt的:P34454; protein_id:CCD71243.1
我希望输出为:
H2L0A8;Q9TXU2
O44447
O61907;Q76NP4;P34454
答案 0 :(得分:6)
来自strapplyc提取的gsubfn package来自x
和sapply的所需字符串将多个字符串折叠为由分号分隔的单个字符串:
library(gsubfn)
sapply(strapplyc(x, "UniProt:([^;]*)"), paste, collapse = ";")
,并提供:
[1] "H2L0A8;Q9TXU2" "O44447" ""
[4] "O61907;Q76NP4;P34454"
其中x
是:
x <- c("H05C05.1c;CE43771;WBGene00019157;status:Partially_confirmed;UniProt:H2L0A8;protein_id:CCD72193.1;>H05C05.1a;CE37385;WBGene00019157;status:Partially_confirmed;UniProt:Q9TXU2;protein_id:CCD72188.1",
"C02B10.5;CE16802;WBGene00015330;status:Partially_confirmed;UniProt:O44447;protein_id:CCD61167.1",
"ZK1127.4;CE07643;WBGene00022851;status:Confirmed;protein_id:CCD73716.1",
"T27C4.4a;CE21211;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:O61907;protein_id:CCD74255.1;>T27C4.4b;CE21212;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:Q76NP4;protein_id:CCD74256.1;>T27C4.4d;CE33331;>F54F2.9;CE39158;WBGene00018836;status:Confirmed;UniProt:P34454;protein_id:CCD71243.1")
ADDED 一些解释。
答案 1 :(得分:3)
使用不常用的替代方法:regmatches<-
regmatches(x,gregexpr("UniProt:.{7}",x),invert=TRUE) <- ""
gsub("UniProt:","",x)
#[1] "H2L0A8;Q9TXU2;"
#[2] "O44447;"
#[3] ""
#[4] "O61907;Q76NP4;P34454;"
你也可以通过向正则表达式指定perl=TRUE
的前瞻和后瞻来实现目标:
sapply(regmatches(x,gregexpr("(?<=UniProt:).+?(?=;)",x,perl=TRUE)),
paste,collapse=";")
#[1] "H2L0A8;Q9TXU2" "O44447"
#[3] "" "O61907;Q76NP4;P34454"