如何使用stringr :: str_match在R中提取子字符串

时间:2018-01-25 02:26:17

标签: r regex tidyverse stringr

我有以下两个字符串:

x <- "chr1:625000-635000.BB_162.Adipose"
y <- "chr1:625000-635000.BB_162.combined.HMSC-ad"

使用此正则表达式,我可以捕获x

的部分内容
> stringr::str_match(x,"(\\w+):(\\d+)-(\\d+)\\.(\\w+)\\.(\\w+)")
     [,1]                                [,2]   [,3]     [,4]     [,5]     [,6]     
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose"

我想要做的是y来获取此

     [,1]                                [,2]   [,3]     [,4]     [,5]     [,6]     
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad"  "chr1" "625000" "635000" "BB_162" "HMSC-ad"

使用我当前的正则表达式并申请y我得到了这个:

   [,1]                                 [,2]   [,3]     [,4]     [,5]     [,6]      
[1,] "chr1:625000-635000.BB_162.combined" "chr1" "625000" "635000" "BB_162" "combined"

如何概括我的正则表达式,以便它可以同时处理xy

更新

S.Kalbar,你的正则表达式给出了这个:

> stringr::str_match(y,"(\\w+):(\\d+)-(\\d+)\\.(\\w+)\\.(\\w+)(?:\\.([A-Za-z-]+))?")
     [,1]                                         [,2]   [,3]     [,4]     [,5]     [,6]       [,7]     
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "combined" "HMSC-ad"
> stringr::str_match(x,"(\\w+):(\\d+)-(\\d+)\\.(\\w+)\\.(\\w+)(?:\\.([A-Za-z-]+))?")
     [,1]                                [,2]   [,3]     [,4]     [,5]     [,6]      [,7]
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose" NA 

什么&#39;我希望得到的是y

                                          [,1]     [,2]   [,3]     [,4]     [,5]     [,6]        
[1,] "chr1:625000-635000.BB_162.combined.HMSC-ad" "chr1" "625000" "635000" "BB_162" "HMSC-ad"

这适用于x

                                   [,1]  [,2]   [,3]     [,4]     [,5]     [,6]      
[1,] "chr1:625000-635000.BB_162.Adipose" "chr1" "625000" "635000" "BB_162" "Adipose" 

2 个答案:

答案 0 :(得分:1)

正则表达式(\w+):(\d+)-(\d+)\.(\w+)(?:\.\w+)?(?:\.([A-Za-z-]+))

RegEx demo

答案 1 :(得分:1)

您可以为引擎分配一些令牌:

(?:(?<=\\d)-(?=\\d))|(?:\\.combined\\.)|[.:]+

分解,这说:

(?:(?<=\\d)-(?=\\d))  # a dash between numbers
|                     # or
(?:\\.combined\\.)    # .combined. literally
|                     # or
[.:]+                 # one of . or :

<小时/> 在R使用str_split()

library(stringr)

x <- c("chr1:625000-635000.BB_162.Adipose", "chr1:625000-635000.BB_162.combined.HMSC-ad")
str_split(x, '(?:(?<=\\d)-(?=\\d))|(?:\\.combined\\.)|[.:]+', simplify = TRUE)

哪个收益

     [,1]   [,2]     [,3]     [,4]     [,5]     
[1,] "chr1" "625000" "635000" "BB_162" "Adipose"
[2,] "chr1" "625000" "635000" "BB_162" "HMSC-ad"