匹配R中包含转义字符的多个字符串

时间:2018-02-23 12:07:05

标签: r regex escaping

我有一个包含表情符号的文本字符串向量和一个只包含表情符号的字典。

A <- c("This :/ :/ :) ^^","is :/ ^^", "weird^^ :)")
B <- c(":)",":/","^^")

我想为包含重复项的每个文本字符串提取所有表情符号的匹配项,因此我的输出应该如下所示:

[[1]]
[1] ":/" ":/" ":)" "^^"

[[2]]
[1] ":/" "^^"

[[3]]
[1] "^^" ":)"

这是我到目前为止所尝试的:

# does not return duplicates
sapply(A, function(x) B[str_detect(x, fixed(B))], USE.NAMES = FALSE)

[[1]]
[1] ":)" ":/" "^^"

[[2]]
[1] ":/" "^^"

[[3]]
[1] ":)" "^^"

# Only returns first instance
str_extract_all(A,fixed(B))

[[1]]
[1] ":)"

[[2]]
[1] ":/"

[[3]]
[1] "^^"

# returns error because of unescaped characters
rm_default(A,pattern=B,fixed=TRUE,extract=TRUE)
Error in stringi::stri_extract_all_regex(text.var, pattern) : 
  Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)
In addition: Warning messages:
1: In if (substring(pattern, 1, 4) == "@rm_") { :
  the condition has length > 1 and only the first element will be used
2: In if (substring(pattern, 1, 1) == "@") { :
  the condition has length > 1 and only the first element will be used

非常感谢任何帮助。

2 个答案:

答案 0 :(得分:1)

一种选择是strsplit,然后提取“B”

中包含的元素
lapply(strsplit(A, "[A-Za-z ]"), function(x) x[x %in% B])
#[[1]]
#[1] ":/" ":/" ":)" "^^"

#[[2]]
#[1] ":/" "^^"

#[[3]]
#[1] "^^" ":)"

答案 1 :(得分:1)

您可以使用B列表中的项目动态构建正则表达式,方法是首先按降序排列项目(如果您有:)):)可以提取 - 这是一个非锚定NFA表达式的要求,其中交替组中的第一个替代“获胜”,请参见[{3}}部分),并转义每个项目。然后,只需致电regmatches / stringr::str_extract_all

regex.escape <- function(string) {
  gsub("([][{}()+*^${|\\\\?])", "\\\\\\1", string)
}

sort.by.length.desc <- function (v) v[order( -nchar(v)) ] 

A <- c("This :/ :/ :) ^^","is :/ ^^", "weird^^ :)")
B <- c(":)",":/","^^")

B <- sort.by.length.desc(B)
pattern <- paste(regex.escape(B), collapse="|")
regmatches(A, gregexpr(pattern, A))

请参阅Remember That The Regex Engine Is Eager

在这种情况下,模式将为:\)|:/|\^\^,输出将为

[[1]]
[1] ":/" ":/" ":)" "^^"

[[2]]
[1] ":/" "^^"

[[3]]
[1] "^^" ":)"