我正在处理如下句子
Has no anorexia
She denies anorexia
Has anorexia
Positive for Anorexia
我的目标是排除包含denies, denied, no
等字词的句子,并仅保留厌食症的正面指示。
最终结果应为
Has anorexia
Positive for Anorexia
我用grepl函数
尝试了这个选项 negation <- c("no","denies","denied")
if (grepl(paste(negation,collapse="|"), Anorexia_sentences[j]) == TRUE){
Anorexia_sentences[j] <- NA
}
并且这不起作用,我认为A no
rexia这个词中没有引起一些问题。任何有关如何解决此问题的建议都非常感谢。
答案 0 :(得分:4)
语料库库的功能类似于 stringr 等效项,但是在 term 级别工作,而不是字符< / em>级别。这有效:
library(corpus)
negation <- c("no", "denies", "denied")
text <- c("Has no anorexia", "She denies anorexia", "Has anorexia",
"Positive for Anorexia", "Denies anorexia")
text[!text_detect(text, negation)]
## [1] "Has anorexia" "Positive for Anorexia"
如果您想要一个仅使用基础R的解决方案,请改用以下代码:
pattern <- paste0("\\b(", paste(negation, collapse = "|"), ")\\b")
text[!grepl(pattern, text, ignore.case = TRUE)]
答案 1 :(得分:0)
You can also do this easily using the quanteda package. To get the character object to register as sentences, you would need either punctuation, or to segment the lines into elements of a character
vector. Then, we can use char_trimsentences()
to remove those with a particular pattern match when tokenized.
library("quanteda")
readLines(textConnection(txt)) %>%
char_trimsentences(exclude_pattern = c("\\bden\\w+\\b|\\bno\\b"))
## text3 text4
## "Has anorexia" "Positive for Anorexia"
The regular expression guarantees that you will match words with the glob pattern "den*", and "no" as a word only (and not part of anorexia.