Question

df$PlateName中的不同格式为：

    MIPS_AGRE_P01_DIL
    MIPS_SSC_P50_DIL
    MIPS_MtS_P34
    MIPS_AT_P1_DIL
    KORgex.mips.G12
    MIPS_SSC_CL_P32_DIL
    MIPS_SSC_CL_Low_DIL

使用这个非常笨重的正则表达式会返回以下类型：

str_match(df$PlateName, 
          "MIPS_([:alnum:]+(?:_[:alnum:]+)?)_[Low|P:digit:]+(?:_[DIL])?|(KORgex).*")) %>%
  as.tibble %>% 
  count(V2)

所有NA都是KORgex.mips.G12类型。我怎样才能使这个正则表达式工作？

AGRE    1654            
AT      93          
MtS     1324            
SSC     5280            
SSC_CL  288         
NA      529

更新

我意识到在这种情况下使用str_extract可能更好，因为这只会返回df$PlateName的每个组件的匹配部分。

我仍然无法获得返回我需要的代码 - 我错过了什么？

str_extract(data$PlateName, "[[:alnum:]+^(?!(MIPS))]_([[:alnum:]&&[^P]]+(_CL)?)?|(KORgex)") %>% 
as.tibble  %>% 
count(value)`

返回：

KORgex      529         
S_AGRE      1654            
S_AT        93          
S_MtS       1324            
S_SSC       5280            
S_SSC_CL    288

我无法为我的生活摆脱S_子类型中的MIPS_！

Answer 1

我们在这里做的最好的事情是使用branch reset group，(?|...|...)只能获得一个组而不是多个组。

但是，R中的stringr / stringi函数基于ICU regex flavor，它不支持分支重置组。

此处使用分支重置最方便的方法是通过grep：

grep(df$PlateName, 
  "(?|MIPS_([:alnum:]+(?:_[:alnum:]+)?)_[Low|P:digit:]+(?:_[DIL])?|(KORgex).*)", perl=TRUE)

Answer 2

我认为这应该有效。在弄乱str_match一段时间之后，我认为使用str_replace删除你不想要的所有内容会更容易。

df$PlateName %>%
  str_replace("([[:alpha:]]+_)?([[:alpha:]]+)(_CL)?(_|\\.)??.*", "\\2\\3") %>%
  as_tibble() %>%
  count(value)

Answer 3

希望这有帮助！

library(stringr)
library(dplyr)

#this step places "|" symbol to match either of two regex patterns
str_match(df$PlateName, "MIPS_(\\S+)_[P|Low].*|(KORgex).*") %>%
  #convert to dataframe to count its occurrences
  data.frame(stringsAsFactors=F) %>%
  mutate(sub_PlateName = coalesce(X2, X3)) %>%
  group_by(sub_PlateName) %>%
  tally()

输出为：

  sub_PlateName     n
1 AGRE              1
2 AT                1
3 KORgex            1
4 MtS               1
5 SSC               1
6 SSC_CL            2

示例数据：

df <- structure(list(PlateName = c("MIPS_AGRE_P01_DIL", "MIPS_SSC_P50_DIL", 
"MIPS_MtS_P34", "MIPS_AT_P1_DIL", "KORgex.mips.G12", "MIPS_SSC_CL_P32_DIL", 
"MIPS_SSC_CL_Low_DIL")), .Names = "PlateName", class = "data.frame", row.names = c(NA, 
-7L))

的更新使用str_extract

str_extract(df$PlateName, "(?<=MIPS_)\\S+(?=_P|_Low)|KORgex") %>% as.tibble %>% count(value) # value n #1 AGRE 1 #2 AT 1 #3 KORgex 1 #4 MtS 1 #5 SSC 1 #6 SSC_CL 2

如何放置|使用stringr :: str_match匹配两个正则表达式模式中的任何一个的符号？

3 个答案: