当我按如下方式运行example-6-llda-learn.scala时,这没关系:
val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);
val tokenizer = {
SimpleEnglishTokenizer() ~> // tokenize on space and punctuation
CaseFolder() ~> // lowercase everything
WordsAndNumbersOnlyFilter() ~> // ignore non-words and non-numbers
MinimumLengthFilter(3) // take terms with >=3 characters
}
val text = {
source ~> // read from the source file
Column(4) ~> // select column containing text
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
TermCounter() ~> // collect counts (needed below)
TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
TermDynamicStopListFilter(30) ~> // filter out 30 most common terms
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms
}
// define fields from the dataset we are going to slice against
val labels = {
source ~> // read from the source file
Column(2) ~> // take column two, the year
TokenizeWith(WhitespaceTokenizer()) ~> // turns label field into an array
TermCounter() ~> // collect label counts
TermMinimumDocumentCountFilter(10) // filter labels in < 10 docs
}
val dataset = LabeledLDADataset(text, labels);
// define the model parameters
val modelParams = LabeledLDAModelParams(dataset);
// Name of the output model folder to generate
val modelPath = file("llda-cvb0-"+dataset.signature+"-"+modelParams.signature);
// Trains the model, writing to the given output path
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
// or could use TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);
但是当我更改最后一行时,它不是正常的: TrainCVB0LabeledLDA(modelParams,dataset,output = modelPath,maxIterations = 1000); 至: TrainGibbsLabeledLDA(modelParams,dataset,output = modelPath,maxIterations = 1500);
而CVB0的方法花费了大量的内存。我培训了10,000个文档的语料库,每个文档大约有10个标签,它将花费30G内存。
答案 0 :(得分:0)
我遇到了同样的情况,我确实认为这是一个错误。查看GIbbsLabeledLDA.scala
文件夹下edu.stanford.nlp.tmt.model.llda
中的src/main/scala
,从第204行开始:
val z = doc.labels(zI);
val pZ = (doc.theta(z)+topicSmoothing(z)) *
(countTopicTerm(z)(term)+termSmooth) /
(countTopic(z)+termSmoothDenom);
doc.labels
不言自明,doc.theta
会记录其标签的分布(实际上是计数),其大小与doc.labels
相同。
zI
是索引变量迭代doc.labels
,而值z
获取实际标签号。问题出现了:可能这些文档只有一个标签 - 比如1000 - 因此zI
为0而z
为1000,然后doc.theta(z)
超出范围。< / p>
我认为解决方案是将doc.theta(z)
修改为doc.theta(zI)
。
(我试图检查结果是否有意义,无论如何这个错误让我对这个工具箱没那么自信。)