我在R中有一个单词列表,如下所示:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
我想从文本中删除上面列表中的单词,如下所示:
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
在删除了不需要的myList单词后,myText应该如下所示:
This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.
我正在使用:
stringr::str_replace_all(myText,"[^a-zA-Z\\s]", " ")
但这对我没有帮助。我该怎么办?
答案 0 :(得分:1)
gsub(paste0(myList, collapse = "|"), "", myText)
给予:
[1] "This is Sample Text, which is better and cleaned , where is not equal to . This is messy text ."
答案 1 :(得分:1)
您可以将PCRE regex与gsub
基本R函数一起使用(它也可以与str_replace_all
中的ICU regex一起使用):
\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)
详细信息
\s*
-0个或多个空格(?<!\w)
-向后隐藏,可确保在当前位置之前没有单词char (?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)
-一个非捕获组,在字符向量内包含转义的项,其中需要删除的单词(?!\w)
-否定的超前查询,可确保在当前位置后立即没有单词char。 注意:我们不能在此处使用\b
字边界,因为regex demo时myList
字符向量中的项目可能以非单词字符开头/结尾是上下文相关的。
查看\b
meaning:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."
详细信息
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
-逃脱R demo online paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|")
-从搜索词向量创建一个|
分隔的替代列表。