Question

text = c（'护士非常乐于助人'，'她真是个宝石'，'帮助'，'没问题'，'不错'）

我想提取大多数单词的1-gram标记和2克标记，例如极端，不，不是

例如，当我得到令牌时，他们应该如下：的，护士，是，非常有帮助，她，真正的，宝石，帮助，没有任何问题，不错

这些是应在术语文档矩阵中显示的术语

感谢您的帮助!!

Answer 1

这是一个可能的解决方案（假设您不想在c("extremely", "no", "not")上仅拆分，但也希望包含与它们类似的单词）。 pkg qdapDictionaries包含amplification.words的一些词典（如“极端”），negation.words（如“不”和“不”）等等。

以下是如何拆分空间的示例，除了空格跟在预定义向量中的单词之后（此处我们使用amplification.words，negation.words和＆amp; {{1定义向量来自deamplification.words）。如果要使用更加自定义的单词列表，可以更改qdapDictionaries的定义。

执行拆分

no_split_words

使用library(stringr) library(qdapDictionaries) text <- c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad') # define list of words where we dont want to split on space no_split_words <- c(amplification.words, negation.words, deamplification.words) # collapse words into form "word1|word2| ... |wordn regex_or <- paste(no_split_words, collapse="|") # define regex to split on space given that the prev word not in no_split_words split_regex <- regex(paste("((?<!",regex_or,"))\\s")) # perform split str_split(text, split_regex) #output [[1]] [1] "the" "nurse" "was" "extremely helpful" [[2]] [1] "she" "was" "truly a" "gem" [[3]] [1] "helping" [[4]] [1] "no issue" [[5]] [1] "not bad"

创建dtm

（假设上面的代码块已经运行）

tidytext

R中的TextMining - 仅提取2克，仅提供1克，休息提取1克

1 个答案:

执行拆分

R中的TextMining - 仅提取2克，仅提供1​​克，休息提取1克

1 个答案:

执行拆分

R中的TextMining - 仅提取2克，仅提供1克，休息提取1克