我正在尝试使用斯坦福主题建模工具箱。我从这里下载了“tmt-0.4.0.jar”-File:http://nlp.stanford.edu/software/tmt/tmt-0.4/,我尝试了一些示例。 示例0和1工作正常,但尝试示例2(无代码更改),我收到以下异常:
[cell] loading pubmed-oa-subset.csv.term-counts.cache.70108071.gz [Concurrent] 32允许线程“Thread-3”中的异常 java.lang.ArrayIndexOutOfBoundsException:-1 at scalanlp.stage.text.TermCounts $ class.getDF(TermFilters.scala:64)at at scalanlp.stage.text.TermCounts $$ anon $ 2.getDF(TermFilters.scala:84)at at scalanlp.stage.text.TermMinimumDocumentCountFilter $$ anonfun $所适用$ $$ 4 $ anonfun申请$ 5 $$ anonfun $ $申请7.适用(TermFilters.scala:172) 在 scalanlp.stage.text.TermMinimumDocumentCountFilter $$ anonfun $所适用$ $$ 4 $ anonfun申请$ 5 $$ anonfun $ $申请7.适用(TermFilters.scala:172) 在scala.collection.Iterator $$ anon $ 22.hasNext(Iterator.scala:390)at scala.collection.Iterator $$ anon $ 22.hasNext(Iterator.scala:388)at scala.collection.Iterator $ class.foreach(Iterator.scala:660)at scala.collection.Iterator $$ anon $ 22.foreach(Iterator.scala:382)at scala.collection.IterableViewLike $ $转化class.foreach(IterableViewLike.scala:41) 在 scala.collection.IterableViewLike $$匿名$ 5.foreach(IterableViewLike.scala:82) 在 scala.collection.TraversableOnce $ class.size(TraversableOnce.scala:104) 在 scala.collection.IterableViewLike $$匿名$ 5.size(IterableViewLike.scala:82) 在 scalanlp.stage.text.DocumentMinimumLengthFilter.filter(DocumentFilters.scala:31) 在 scalanlp.stage.text.DocumentMinimumLengthFilter.filter(DocumentFilters.scala:28) 在 scalanlp.stage.generic.Filter $$ anonfun $ $适用1.适用(Filter.scala:38) 在 scalanlp.stage.generic.Filter $$ anonfun $ $适用1.适用(Filter.scala:38) 在scala.collection.Iterator $$ anon $ 22.hasNext(Iterator.scala:390)at edu.stanford.nlp.tmt.data.concurrent.Concurrent $$ anonfun $表$ 2.适用(Concurrent.scala:100) 在 edu.stanford.nlp.tmt.data.concurrent.Concurrent $$ anonfun $表$ 2.适用(Concurrent.scala:88) 在 edu.stanford.nlp.tmt.data.concurrent.Concurrent $$匿名$ 4.run(Concurrent.scala:45)
为什么我会收到此异常,以及如何解决此问题? 非常感谢你的帮助!
PS:代码与网站示例2中的代码相同:
// Stanford TMT Example 2 - Learning an LDA model
// http://nlp.stanford.edu/software/tmt/0.4/
// tells Scala where to find the TMT classes
import scalanlp.io._;
import scalanlp.stage._;
import scalanlp.stage.text._;
import scalanlp.text.tokenize._;
import scalanlp.pipes.Pipes.global._;
import edu.stanford.nlp.tmt.stage._;
import edu.stanford.nlp.tmt.model.lda._;
import edu.stanford.nlp.tmt.model.llda._;
val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);
val tokenizer = {
SimpleEnglishTokenizer() ~> // tokenize on space and punctuation
CaseFolder() ~> // lowercase everything
WordsAndNumbersOnlyFilter() ~> // ignore non-words and non-numbers
MinimumLengthFilter(3) // take terms with >=3 characters
}
val text = {
source ~> // read from the source file
Column(4) ~> // select column containing text
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
TermCounter() ~> // collect counts (needed below)
TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
TermDynamicStopListFilter(30) ~> // filter out 30 most common terms
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms
}
// turn the text into a dataset ready to be used with LDA
val dataset = LDADataset(text);
// define the model parameters
val params = LDAModelParams(numTopics = 30, dataset = dataset,
topicSmoothing = 0.01, termSmoothing = 0.01);
// Name of the output model folder to generate
val modelPath = file("lda-"+dataset.signature+"-"+params.signature);
// Trains the model: the model (and intermediate models) are written to the
// output folder. If a partially trained model with the same dataset and
// parameters exists in that folder, training will be resumed.
TrainCVB0LDA(params, dataset, output=modelPath, maxIterations=1000);
// To use the Gibbs sampler for inference, instead use
// TrainGibbsLDA(params, dataset, output=modelPath, maxIterations=1500);
答案 0 :(得分:1)
答案已由该工具的作者发布。请看这里。
这通常发生在你有一个过时的.cache文件时 - 不幸的是 错误消息不是特别有用。尝试在运行中删除缓存 文件夹并再次运行。
https://lists.cs.princeton.edu/pipermail/topic-models/2012-July/001979.html