Question

我花了几天时间研究R中的主题模型，我想知道我是否可以做到以下几点：

我希望R根据具有特定术语的预定义术语表来构建主题。我已经使用此列表来识别文档中的ngrams（RWeka），并使用以下代码仅计算在我的termlist中出现的那些术语：

terms=read.delim("TermList.csv", header=F, stringsAsFactor=F)

biTok=function(x) NGramTokenizer(x, Weka_control(min=1, max=4))

tdm=TermDocumentMatrix(data.corpus, control=list(tokenizer=biTok))

现在，我想再次使用此列表，仅根据我的术语表中的条款搜索文档中的主题。

实施例：在以下句子中：＆＃34;这些安排带来更高的团队绩效和更好的用户满意度。我想要复合词和团队表现＆＃34;和＆＃34;用户满意度＆＃34;在主题内而不是处理＆＃34;团队＆＃34;，＆＃34;表现＆＃34;，＆＃34;用户＆＃34;和＆＃34;满意度＆＃34;作为单一术语和建立主题。这就是我需要使用预定义列表的原因。

是否有可能在R中定义这样的条件？

Answer 1

也许是这样的？

tokenizing.phrases <- c("team performance", "user satisfaction") # plus your others you have identified

然后加载此功能：

phraseTokenizer <- function(x) {
  require(stringr)

  x <- as.character(x) # extract the plain text from the tm TextDocument object
  x <- str_trim(x)
  if (is.na(x)) return("")
  #warning(paste("doing:", x))
  phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))

  if (any(phrase.hits)) {
    # only split once on the first hit, so we don't have to worry about multiple occurences of the same phrase
    split.phrase <- tokenizing.phrases[which(phrase.hits)[1]] 
    # warning(paste("split phrase:", split.phrase))
    temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
    out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2])) 
  } else {
    out <- MC_tokenizer(x)
  }

  # get rid of any extraneous empty strings, which can happen if a phrase occurs just before a punctuation
  out[out != ""]
}

然后使用预定义的tokeninzing.phrases：

创建学期文档矩阵

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer))

当您运行主题模型功能时，它应该与您已识别为模型的一部分的bigrams一起使用（尽管根据您已识别的内容，列表更长）。

R中的主题建模：基于预定义的术语列表构建主题

1 个答案: