使用R中的正则表达式子集数据

时间:2014-05-13 23:36:11

标签: regex r substring

我想从数据框的列中提取特定信息,并将其添加到同一数据框中的新列。复杂性在于某些行根本没有我要提取的信息(“UniProt:”之后的6个字符),而其他行有多次出现 - 我希望这些行相应地显示,因为此列包含标识符在我的数据框中。

这是一个例子;我从我的数据框中复制了几行Fasta.headers:

第1行:

  

H05C05.1c; CE43771; WBGene00019157;状态:Partially_confirmed; UniProt的:H2L0A8; protein_id:CCD72193.1;> H05C05.1a; CE37385; WBGene00019157;状态:Partially_confirmed; UniProt的:Q9TXU2; protein_id:CCD72188.1

第2行:

  

C02B10.5; CE16802; WBGene00015330;状态:Partially_confirmed; UniProt的:O44447; protein_id:CCD61167.1

第3行:

  

ZK1127.4; CE07643; WBGene00022851;状态:成熟; protein_id:CCD73716.1

第4行:

  

T27C4.4a; CE21211; WBGene00003025;轨迹:LIN-40;状态:成熟; UniProt的:O61907; protein_id:CCD74255.1;> T27C4.4b; CE21212; WBGene00003025;轨迹:LIN-40;状态:成熟; UniProt的:Q76NP4; protein_id:CCD74256.1;> T27C4.4d; CE33331;> F54F2.9; CE39158; WBGene00018836;状态:成熟; UniProt的:P34454; protein_id:CCD71243.1

我希望输出为:

H2L0A8;Q9TXU2
O44447

O61907;Q76NP4;P34454

2 个答案:

答案 0 :(得分:6)

来自strapplyc提取的gsubfn package来自xsapply的所需字符串将多个字符串折叠为由分号分隔的单个字符串:

library(gsubfn)
sapply(strapplyc(x, "UniProt:([^;]*)"), paste, collapse = ";")

,并提供:

[1] "H2L0A8;Q9TXU2"        "O44447"               ""                    
[4] "O61907;Q76NP4;P34454"

其中x是:

x <-  c("H05C05.1c;CE43771;WBGene00019157;status:Partially_confirmed;UniProt:H2L0A8;protein_id:CCD72193.1;>H05C05.1a;CE37385;WBGene00019157;status:Partially_confirmed;UniProt:Q9TXU2;protein_id:CCD72188.1",
  "C02B10.5;CE16802;WBGene00015330;status:Partially_confirmed;UniProt:O44447;protein_id:CCD61167.1",
    "ZK1127.4;CE07643;WBGene00022851;status:Confirmed;protein_id:CCD73716.1",
    "T27C4.4a;CE21211;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:O61907;protein_id:CCD74255.1;>T27C4.4b;CE21212;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:Q76NP4;protein_id:CCD74256.1;>T27C4.4d;CE33331;>F54F2.9;CE39158;WBGene00018836;status:Confirmed;UniProt:P34454;protein_id:CCD71243.1")

ADDED 一些解释。

答案 1 :(得分:3)

使用不常用的替代方法:regmatches<-

regmatches(x,gregexpr("UniProt:.{7}",x),invert=TRUE) <- ""
gsub("UniProt:","",x)
#[1] "H2L0A8;Q9TXU2;"
#[2] "O44447;"
#[3] ""
#[4] "O61907;Q76NP4;P34454;"

你也可以通过向正则表达式指定perl=TRUE的前瞻和后瞻来实现目标:

sapply(regmatches(x,gregexpr("(?<=UniProt:).+?(?=;)",x,perl=TRUE)),
       paste,collapse=";")

#[1] "H2L0A8;Q9TXU2"        "O44447"              
#[3] ""                     "O61907;Q76NP4;P34454"