我正在使用 Stanford Topic Modeling Toolbox 0.4.0 for LDA ,我注意到如果我想删除标准的英语停用词,我可以使用StopWordFilter("en")
作为最后一步标记器,但我该如何使用它呢?
import scalanlp.io._;
import scalanlp.stage._;
import scalanlp.stage.text._;
import scalanlp.text.tokenize._;
import scalanlp.pipes.Pipes.global._;
import edu.stanford.nlp.tmt.stage._;
import edu.stanford.nlp.tmt.model.lda._;
import edu.stanford.nlp.tmt.model.llda._;
val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);
val tokenizer = {
SimpleEnglishTokenizer() ~> // tokenize on space and punctuation
CaseFolder() ~> // lowercase everything
WordsAndNumbersOnlyFilter() ~> // ignore non-words and non-numbers
MinimumLengthFilter(3) // take terms with >=3 characters
StopWordFilter("en") // how to use it? it's not working.
}
val text = {
source ~> // read from the source file
Column(4) ~> // select column containing text
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
TermCounter() ~> // collect counts (needed below)
TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
TermDynamicStopListFilter(30) ~> // filter out 30 most common terms
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms
}
// turn the text into a dataset ready to be used with LDA
val dataset = LDADataset(text);
// define the model parameters
val params = LDAModelParams(numTopics = 30, dataset = dataset,
topicSmoothing = 0.01, termSmoothing = 0.01);
// Name of the output model folder to generate
val modelPath = file("lda-"+dataset.signature+"-"+params.signature);
// Trains the model: the model (and intermediate models) are written to the
// output folder. If a partially trained model with the same dataset and
// parameters exists in that folder, training will be resumed.
TrainCVB0LDA(params, dataset, output=modelPath, maxIterations=1000);
// To use the Gibbs sampler for inference, instead use
// TrainGibbsLDA(params, dataset, output=modelPath, maxIterations=1500);
答案 0 :(得分:0)
我认为它应该是在val文本中:
val text = {
source ~> // read from the source file
Column(4) ~> // select column containing text
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
TermCounter() ~> // collect counts (needed below)
TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
StopWordFilter("en") ~>
TermDynamicStopListFilter(30) ~> // filter out 30 most common terms
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms
}
您也可以使用自己的停止列表,而不是StopWordFilter("en") ~>
TermStopListFilter(List("a", "and", "but", "if")) ~>
只需在此列表中添加所有选定的停用词。
答案 1 :(得分:0)
如果stopwords.txt的每一行(例如从http://jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop下载)包含停用词,则可以使用
val text = {
source ~> // read from the source file
Column(3) ~> // select column containing text
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
TermCounter() ~> // collect counts (needed below)
TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
TermStopListFilter(scala.io.Source.fromFile("stopwords.txt").getLines().toList) ~>
TermDynamicStopListFilter(30) ~> // filter out 30 most common terms
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms
}
答案 2 :(得分:-1)
我不是专家,但我想你应该这样做
val tokenizer = {
SimpleEnglishTokenizer() ~> // tokenize on space and punctuation
CaseFolder() ~> // lowercase everything
WordsAndNumbersOnlyFilter() ~> // ignore non-words and non-numbers
MinimumLengthFilter(3) ~> // take terms with >=3 characters
new StopWordFilter("en") // remove stop words for the english language
}