Question

我希望结合一个特定单词之后的单词，我尝试了 bigram 方法，这种方法太慢而且还尝试使用 gregexpr ，但没有得到任何好的解决方案。对于前

text="This approach isnt good enough."
 BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
 BigramTokenizer(text)
[1] "This approach" "approach isnt" "isnt good"     "good enough"

我真正想要的是 isnt_good 作为文字中的单个词，结合不是之后的下一个词。

text
"This approach isnt_good enough."

任何有效的方法转换为unigram.Thanks。

Answer 1

要提取所有出现的单词“is not”，可以执行以下单词：

library(stringr)
pattern <- "isnt \\w+"
str_extract_all(text, pattern)

[[1]]
[1] "isnt good"

它基本上与下面的示例（来自base包）做同样的事情，但我发现stringr解决方案更优雅和可读。

> regmatches(text, regexpr(pattern, text))
[1] "isnt good"

更新

要将isnt x替换为isnt_x，只需要gsub基础包。

gsub("isnt (\\w+)", "isnt_\\1", text)
[1] "This approach isnt_good enough."

您要做的是使用捕获组将括号内的任何内容复制到\\1。有关详细介绍，请参阅此页：http://www.regular-expressions.info/brackets.html

Answer 2

这个功能怎么样？

joinWords <- function(string, word){
  y <- paste0(word, " ")
  x <- unlist(strsplit(string, y))
  paste0(x[1], word, "_", x[2])
}

> text <- "This approach isnt good enough."
> joinWords(text, "isnt")
# [1] "This approach isnt_good enough."
> joinWords("This approach might work for you", "might")
# [1] "This approach might_work for you"

结合特定单词后面的任何单词

2 个答案:

更新