正则表达式R:字边界和分隔字符

时间:2016-12-08 13:47:45

标签: r regex

我正在使用gsub按照字典的想法替换R中向量中的单词。也就是说,给定的单词组(同义词)syn = c("Cash", "\\$")应该被单词(word = "MONEY")替换。

text <- c("I spent 100$", "Cash can be used", "Cashier doesnt count", "a separate $")

到目前为止,我使用它来替换同义词:

syn <- c("Cash", "\\$")
word <- "MONEY"

regex <- paste0("\\b(", paste(syn, collapse = "|"), ")\\b")
# "\\b(Cash|\\$)\\b"

gsub(regex, word, text)
# "I spent 100MONEY"     "MONEY can be used"    "Cashier doesnt count" "a separate $" 

在$ -sign附加到字母数字的情况下适用,但如果符号分开则失败。如果我放弃了单词边界(\\b),那么找到了$ -sign,但是#34; Cash&#34;在&#34;收银员&#34;。

你知道我怎么能有一个单词边界但是还能找到单个$ -sign吗?

2 个答案:

答案 0 :(得分:2)

将自定义边界与PCRE正则表达式一起使用:

  • (?<!\p{L}) - 一个单词的开头(之前没有字母)
  • (?!\p{L}) - 一个字的结尾(之后没有字母)

请参阅regex demo

示例R代码:

> text <- c("I spent 100$", "Cash can be used", "Cashier doesnt count", "a seperate $")
> syn <- c("Cash", "\\$")
> word <- "MONEY"
> regex <- paste0("(?<!\\p{L})(?:", paste(syn, collapse = "|"), ")(?!\\p{L})")
> gsub(regex, word, text, perl=TRUE)
[1] "I spent 100MONEY"     "MONEY can be used"    "Cashier doesnt count" "a seperate MONEY"    
> 

答案 1 :(得分:0)

regex <- paste0("\\b", paste(syn, collapse = "\\b|"))
#"\\bCash\\b|\\$"
gsub(regex,word,text)
#[1] "I spent 100MONEY"     "MONEY can be used"    "Cashier doesnt count" "a seperate MONEY"