Question

我尝试做一些主题建模但想要使用存在而不是单个单词的短语即。

library(topicmodels)
library(tm)
my.docs = c('the sky is blue, hot sun', 'flowers,hot sun', 'black cats, bees, rats and mice')
my.corpus = Corpus(VectorSource(my.docs))
my.dtm = DocumentTermMatrix(my.corpus)
inspect(my.dtm)

当我检查我的dtm时，它将所有单词分开，但是我想要所有的短语，即每个应该有一个列：天是蓝的烈日花卉黑猫蜜蜂大鼠和小鼠

如何使文档术语矩阵识别短语和单词？它们是逗号分隔的

解决方案需要高效，因为我想在大量数据上运行它

Answer 1

您可以尝试使用自定义标记生成器的方法。您将所需的多字词定义为短语（我不知道执行该步骤的算法代码）：

tokenizing.phrases <- c("sky is blue", "hot sun", "black cats")

请注意，没有完成任何干扰，所以如果你想要“黑猫”和“黑猫”，那么你需要输入两种变体。案例被忽略。

然后你需要创建一个函数：

    phraseTokenizer <- function(x) {
      require(stringr)

      x <- as.character(x) # extract the plain text from the tm TextDocument object
      x <- str_trim(x)
      if (is.na(x)) return("")
      #warning(paste("doing:", x))
      phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))

      if (any(phrase.hits)) {
        # only split once on the first hit, so you don't have to worry about multiple occurrences of the same phrase
        split.phrase <- tokenizing.phrases[which(phrase.hits)[1]] 
        # warning(paste("split phrase:", split.phrase))
        temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
        out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2])) 
      } else {
        out <- MC_tokenizer(x)
      }


 out[out != ""]
}

然后您可以正常进行创建术语文档矩阵，但这次您可以通过控制参数在语料库中包含标记化短语。

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer))

Answer 2

或许看一下这个相对较新的关于该主题的出版物：

http://web.engr.illinois.edu/~hanj/pdf/kdd13_cwang.pdf

他们提供了一种识别短语和将文档分区/标记为这些短语的算法。

使用短语而不是单个单词在R中进行主题建模

2 个答案: