R grep正则表达式可以在向量中找到两次精确元素

时间:2018-03-14 06:59:09

标签: r regex

跟进this question,我有另一个例子,我无法在那里应用已接受的答案。

这一次,我想找到group向量中的每个EXACT labs元素,发生两次。

labs <- c("Beijing T0 - BC-89 + CN --vs-- Zhangjiakou T0 - BC-89 + CN",
"Beijing T24 - BC-89 + CN --vs-- Zhangjiakou T24 - BC-89 + CN",
"Beijing T0 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Zhangjiakou T0 - BC-89 + CN with 2% DD + 1.6% ZC",
"Beijing T24 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Zhangjiakou T24 - BC-89 + CN with 2% DD + 1.6% ZC",
"Beijing T0 - BC-89 with 2% Puricare + 5% Merquat + CN --vs-- Zhangjiakou T0 - BC-89 with 2% Puricare + 5% Merquat + CN",
"Beijing T24 - BC-89 with 2% Puricare + 5% Merquat + CN --vs-- Zhangjiakou T24 - BC-89 with 2% Puricare + 5% Merquat + CN",
"Beijing T0 - BC-89 + CN --vs-- Beijing T24 - BC-89 + CN",
"Zhangjiakou T0 - BC-89 + CN --vs-- Zhangjiakou T24 - BC-89 + CN",
"Beijing T0 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Beijing T24 - BC-89 + CN with 2% DD + 1.6% ZC",
"Zhangjiakou T0 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Zhangjiakou T24 - BC-89 + CN with 2% DD + 1.6% ZC",
"Beijing T0 - BC-89 with 2% Puricare + 5% Merquat + CN --vs-- Beijing T24 - BC-89 with 2% Puricare + 5% Merquat + CN",
"Zhangjiakou T0 - BC-89 with 2% Puricare + 5% Merquat + CN --vs-- Zhangjiakou T24 - BC-89 with 2% Puricare + 5% Merquat + CN",
"Beijing T0 - BC-89 + CN --vs-- Beijing T0 - BC-89 + CN with 2% DD + 1.6% ZC",
"Beijing T0 - BC-89 + CN --vs-- Beijing T0 - BC-89 with 2% Puricare + 5% Merquat + CN",
"Beijing T0 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Beijing T0 - BC-89 with 2% Puricare + 5% Merquat + CN",
"Beijing T24 - BC-89 + CN --vs-- Beijing T24 - BC-89 + CN with 2% DD + 1.6% ZC",
"Beijing T24 - BC-89 + CN --vs-- Beijing T24 - BC-89 with 2% Puricare + 5% Merquat + CN",
"Beijing T24 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Beijing T24 - BC-89 with 2% Puricare + 5% Merquat + CN",
"Zhangjiakou T0 - BC-89 + CN --vs-- Zhangjiakou T0 - BC-89 + CN with 2% DD + 1.6% ZC",
"Zhangjiakou T0 - BC-89 + CN --vs-- Zhangjiakou T0 - BC-89 with 2% Puricare + 5% Merquat + CN",
"Zhangjiakou T0 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Zhangjiakou T0 - BC-89 with 2% Puricare + 5% Merquat + CN",
"Zhangjiakou T24 - BC-89 + CN --vs-- Zhangjiakou T24 - BC-89 + CN with 2% DD + 1.6% ZC",
"Zhangjiakou T24 - BC-89 + CN --vs-- Zhangjiakou T24 - BC-89 with 2% Puricare + 5% Merquat + CN",
"Zhangjiakou T24 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Zhangjiakou T24 - BC-89 with 2% Puricare + 5% Merquat + CN")
labs
groups <- c("BC-89 + CN", "BC-89 + CN with 2% DD + 1.6% ZC", "BC-89 with 2% Puricare + 5% Merquat + CN")
groups

这是我的尝试,但无效:

A <- grep(gsub("\\+", "\\\\+", paste0(groups[1], "{2}")), labs, value=TRUE) #only elements with exactly "BC-89 + CN" appearing twice
B <- grep(gsub("\\+", "\\\\+", paste0(groups[2], "{2}")), labs, value=TRUE) #only elements with exactly "BC-89 + CN with 2% DD + 1.6% ZC" appearing twice
C <- grep(gsub("\\+", "\\\\+", paste0(groups[3], "{2}")), labs, value=TRUE) #only elements with exactly "BC-89 with 2% Puricare + 5% Merquat + CN" appearing twice

期望的输出是(注意我想要精确的组,所以“BC-89 + CN”不应该找到“BC-89 + CN,2%DD + 1.6%ZC”):

> A
[1] "Beijing T0 - BC-89 + CN --vs-- Zhangjiakou T0 - BC-89 + CN"     
[2] "Beijing T24 - BC-89 + CN --vs-- Zhangjiakou T24 - BC-89 + CN"   
[3] "Beijing T0 - BC-89 + CN --vs-- Beijing T24 - BC-89 + CN"        
[4] "Zhangjiakou T0 - BC-89 + CN --vs-- Zhangjiakou T24 - BC-89 + CN"
> B
[1] "Beijing T0 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Zhangjiakou T0 - BC-89 + CN with 2% DD + 1.6% ZC"     
[2] "Beijing T24 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Zhangjiakou T24 - BC-89 + CN with 2% DD + 1.6% ZC"   
[3] "Beijing T0 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Beijing T24 - BC-89 + CN with 2% DD + 1.6% ZC"        
[4] "Zhangjiakou T0 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Zhangjiakou T24 - BC-89 + CN with 2% DD + 1.6% ZC"
> C
[1] "Beijing T0 - BC-89 with 2% Puricare + 5% Merquat + CN --vs-- Zhangjiakou T0 - BC-89 with 2% Puricare + 5% Merquat + CN"     
[2] "Beijing T24 - BC-89 with 2% Puricare + 5% Merquat + CN --vs-- Zhangjiakou T24 - BC-89 with 2% Puricare + 5% Merquat + CN"   
[3] "Beijing T0 - BC-89 with 2% Puricare + 5% Merquat + CN --vs-- Beijing T24 - BC-89 with 2% Puricare + 5% Merquat + CN"        
[4] "Zhangjiakou T0 - BC-89 with 2% Puricare + 5% Merquat + CN --vs-- Zhangjiakou T24 - BC-89 with 2% Puricare + 5% Merquat + CN"

1 个答案:

答案 0 :(得分:1)

您应该使用(paste0(group[1], ".*", group[1])sprintf("(%s.*){2}", groups[1])

a <- grep(gsub("\\+", "\\\\+", sprintf("(%s.*){2}", groups[1])), labs)
b <- grep(gsub("\\+", "\\\\+", sprintf("(%s.*){2}", groups[2])), labs)
c <- grep(gsub("\\+", "\\\\+", sprintf("(%s.*){2}", groups[3])), labs)

输出:

> print(list(a, b, c))
# [[1]]
#  [1]  1  2  3  4  7  8  9 10 13 16 19 22
# 
# [[2]]
# [1]  3  4  9 10
# 
# [[3]]
# [1]  5  6 11 12

groups[1]"BC-89 + CN")为例,您只找到包含"BC-89 + CNBC-89 + CN"的元素,但在您想要的字符串出现之间可能会出现其他字符。

修改

由于“BC-89 + CN”组不应包含“BC-89 + CN含2%DD + 1.6%ZC”,因此需要再做一步

a <- a[!a %in% b]

输出:

> print(a)
# [1]  1  2  7  8 13 16 19 22

编辑2:

我注意到您可能想要检查'group'字符串是否出现在'--vs--'之前和之后(两次),并考虑另一种方法。

check_group <- function(ele, group) {
  x <- strsplit(ele, " --vs-- ")[[1]]
  group <- gsub("\\-", "\\\\-", group)
  group <- gsub("\\+", "\\\\+", group)
  group <- paste0(group, "$")
  if (grepl(group, x[[1]]) & grepl(group, x[[2]])) {
    return(ele)
  } else {
    return(NULL)
  }
}

remove_null <- function(x) {
  return(unlist(x[!sapply(x, is.null)]))
}


a1 <- remove_null(lapply(labs, check_group, groups[1]))
a2 <- remove_null(lapply(labs, check_group, groups[2]))
a3 <- remove_null(lapply(labs, check_group, groups[3]))

输出:

> print(list(a1, a2, a3))
# [[1]]
# [1] "Beijing T0 - BC-89 + CN --vs-- Zhangjiakou T0 - BC-89 + CN"      "Beijing T24 - BC-89 + CN --vs-- Zhangjiakou T24 - BC-89 + CN"   
# [3] "Beijing T0 - BC-89 + CN --vs-- Beijing T24 - BC-89 + CN"         "Zhangjiakou T0 - BC-89 + CN --vs-- Zhangjiakou T24 - BC-89 + CN"
# 
# [[2]]
# [1] "Beijing T0 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Zhangjiakou T0 - BC-89 + CN with 2% DD + 1.6% ZC"     
# [2] "Beijing T24 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Zhangjiakou T24 - BC-89 + CN with 2% DD + 1.6% ZC"   
# [3] "Beijing T0 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Beijing T24 - BC-89 + CN with 2% DD + 1.6% ZC"        
# [4] "Zhangjiakou T0 - BC-89 + CN with 2% DD + 1.6% ZC --vs-- Zhangjiakou T24 - BC-89 + CN with 2% DD + 1.6% ZC"
# 
# [[3]]
# [1] "Beijing T0 - BC-89 with 2% Puricare + 5% Merquat + CN --vs-- Zhangjiakou T0 - BC-89 with 2% Puricare + 5% Merquat + CN"     
# [2] "Beijing T24 - BC-89 with 2% Puricare + 5% Merquat + CN --vs-- Zhangjiakou T24 - BC-89 with 2% Puricare + 5% Merquat + CN"   
# [3] "Beijing T0 - BC-89 with 2% Puricare + 5% Merquat + CN --vs-- Beijing T24 - BC-89 with 2% Puricare + 5% Merquat + CN"        
# [4] "Zhangjiakou T0 - BC-89 with 2% Puricare + 5% Merquat + CN --vs-- Zhangjiakou T24 - BC-89 with 2% Puricare + 5% Merquat + CN"