我正在使用gsub
按照字典的想法替换R中向量中的单词。也就是说,给定的单词组(同义词)syn = c("Cash", "\\$")
应该被单词(word = "MONEY"
)替换。
text <- c("I spent 100$", "Cash can be used", "Cashier doesnt count", "a separate $")
到目前为止,我使用它来替换同义词:
syn <- c("Cash", "\\$")
word <- "MONEY"
regex <- paste0("\\b(", paste(syn, collapse = "|"), ")\\b")
# "\\b(Cash|\\$)\\b"
gsub(regex, word, text)
# "I spent 100MONEY" "MONEY can be used" "Cashier doesnt count" "a separate $"
在$ -sign附加到字母数字的情况下适用,但如果符号分开则失败。如果我放弃了单词边界(\\b
),那么找到了$ -sign,但是#34; Cash&#34;在&#34;收银员&#34;。
你知道我怎么能有一个单词边界但是还能找到单个$ -sign吗?
答案 0 :(得分:2)
将自定义边界与PCRE正则表达式一起使用:
(?<!\p{L})
- 一个单词的开头(之前没有字母)(?!\p{L})
- 一个字的结尾(之后没有字母)请参阅regex demo。
示例R代码:
> text <- c("I spent 100$", "Cash can be used", "Cashier doesnt count", "a seperate $")
> syn <- c("Cash", "\\$")
> word <- "MONEY"
> regex <- paste0("(?<!\\p{L})(?:", paste(syn, collapse = "|"), ")(?!\\p{L})")
> gsub(regex, word, text, perl=TRUE)
[1] "I spent 100MONEY" "MONEY can be used" "Cashier doesnt count" "a seperate MONEY"
>
答案 1 :(得分:0)
regex <- paste0("\\b", paste(syn, collapse = "\\b|"))
#"\\bCash\\b|\\$"
gsub(regex,word,text)
#[1] "I spent 100MONEY" "MONEY can be used" "Cashier doesnt count" "a seperate MONEY"