标记的LDA在斯坦福主题建模工具箱中学习

时间:2013-05-31 16:00:11

标签: stanford-nlp topic-modeling

当我按如下方式运行example-6-llda-learn.scala时,这没关系:

val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);

val tokenizer = {
  SimpleEnglishTokenizer() ~>            // tokenize on space and punctuation
  CaseFolder() ~>                        // lowercase everything
  WordsAndNumbersOnlyFilter() ~>         // ignore non-words and non-numbers
  MinimumLengthFilter(3)                 // take terms with >=3 characters
}

val text = {
  source ~>                              // read from the source file
  Column(4) ~>                           // select column containing text
  TokenizeWith(tokenizer) ~>             // tokenize with tokenizer above
  TermCounter() ~>                       // collect counts (needed below)
  TermMinimumDocumentCountFilter(4) ~>   // filter terms in <4 docs
  TermDynamicStopListFilter(30) ~>       // filter out 30 most common terms
  DocumentMinimumLengthFilter(5)         // take only docs with >=5 terms
}

// define fields from the dataset we are going to slice against
val labels = {
  source ~>                              // read from the source file
  Column(2) ~>                           // take column two, the year
  TokenizeWith(WhitespaceTokenizer()) ~> // turns label field into an array
  TermCounter() ~>                       // collect label counts
  TermMinimumDocumentCountFilter(10)     // filter labels in < 10 docs
}

val dataset = LabeledLDADataset(text, labels);

// define the model parameters
val modelParams = LabeledLDAModelParams(dataset);

// Name of the output model folder to generate
val modelPath = file("llda-cvb0-"+dataset.signature+"-"+modelParams.signature);

// Trains the model, writing to the given output path
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
// or could use TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);

但是当我更改最后一行时,它不是正常的:    TrainCVB0LabeledLDA(modelParams,dataset,output = modelPath,maxIterations = 1000); 至:    TrainGibbsLabeledLDA(modelParams,dataset,output = modelPath,maxIterations = 1500);

而CVB0的方法花费了大量的内存。我培训了10,000个文档的语料库,每个文档大约有10个标签,它将花费30G内存。

1 个答案:

答案 0 :(得分:0)

我遇到了同样的情况,我确实认为这是一个错误。查看GIbbsLabeledLDA.scala文件夹下edu.stanford.nlp.tmt.model.llda中的src/main/scala,从第204行开始:

val z = doc.labels(zI);

val pZ = (doc.theta(z)+topicSmoothing(z)) *
         (countTopicTerm(z)(term)+termSmooth) /
         (countTopic(z)+termSmoothDenom);

doc.labels不言自明,doc.theta会记录其标签的分布(实际上是计数),其大小与doc.labels相同。

zI是索引变量迭代doc.labels,而值z获取实际标签号。问题出现了:可能这些文档只有一个标签 - 比如1000 - 因此zI为0而z为1000,然后doc.theta(z)超出范围。< / p>

我认为解决方案是将doc.theta(z)修改为doc.theta(zI)
(我试图检查结果是否有意义,无论如何这个错误让我对这个工具箱没那么自信。)