如何使用带有边界的R regex删除部分字符串

时间:2017-11-17 01:11:23

标签: r regex

我有这3个示例字符串:

x <- "AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer(0.989)More Information | Similar Motifs Found"
y <- "NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer(0.828)More Information | Similar Motifs Found"
z <- "SPIB/MA0081.1/Jaspar(0.753)More Information | Similar Motifs Found"

我想要做的是删除在最后/分隔符的第一个单词之后出现的字符串,结果是:

AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer
NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer
SPIB/MA0081.1/Jaspar

我试过了,但它没有给出我想要的东西:

> sub("\\(.*?\\)More Information | Similar Motifs Found","",x)
[1] "AP-1| Similar Motifs Found"

做正确的方法是什么?

1 个答案:

答案 0 :(得分:1)

您可以使用贪婪模式cin >>匹配到最后cin >>,然后使用后引用提取该组:

(.*/\\w+).*

/word中,第一个v <- c("AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer(0.989)More Information | Similar Motifs Found", "NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer(0.828)More Information | Similar Motifs Found", "SPIB/MA0081.1/Jaspar(0.753)More Information | Similar Motifs Found") sub("(.*/\\w+).*", "\\1", v) # [1] "AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer" "NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer" # [3] "SPIB/MA0081.1/Jaspar" 是贪婪的并且会尽可能多地匹配,停止条件为(.*/\\w+).* + .*(由/匹配);第二个a word匹配字符串的剩余部分。