我正在使用mallet库进行主题建模。我的数据集在filePath路径中,csvIterator似乎可以读取数据,因为model.getData()有大约27000行等于我的数据集。 我写了一个循环,打印10个第一个文档的实例和主题序列,但是令牌的大小是0.我哪里出错?
在下面,我想显示主题的前5个单词,其中包含10个第一个文档的比例,但所有输出都是相同的。
cosole出来的例子:
----文件0
0 0.200 com(1723)twitter(1225)http(871)cbr(688)canberra(626)
1 0.200 com(981)twitter(901)day(205)may(159)wed(156)
2 0.200 twitter(1068)com(947)act(433)actvcc(317)canberra(302)
3 0.200 http(1039)堪培拉(841)工作(378)dlvr(313)com(228)
4 0.200 com(1185)www(1074)http(831)news(708)canberratimes(560)
----文件1
0 0.200 com(1723)twitter(1225)http(871)cbr(688)canberra(626)
1 0.200 com(981)twitter(901)day(205)may(159)wed(156)
2 0.200 twitter(1068)com(947)act(433)actvcc(317)canberra(302)
3 0.200 http(1039)堪培拉(841)工作(378)dlvr(313)com(228)
4 0.200 com(1185)www(1074)http(831)news(708)canberratimes(560)
据我所知,LDA模型会生成每个文档并将其分配给主题单词。那么为什么每个文件的结果都是一样的?
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
pipeList.add(new CharSequenceLowercase());
pipeList.add(new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")));
//stoplists/en.txt
pipeList.add(new TokenSequenceRemoveStopwords(new File(pathStopWords), "UTF-8", false, false, false));
pipeList.add(new TokenSequence2FeatureSequence());
InstanceList instances = new InstanceList(new SerialPipes(pipeList));
Reader fileReader = new InputStreamReader(new FileInputStream(new File(filePath)), "UTF-8");
//header of my data set
// row,location,username,hashtaghs,text,retweets,date,favorites,numberOfComment
CsvIterator csvIterator = new CsvIterator(fileReader,
Pattern.compile("^(\\d+)[,]*[^,]*[,]*[^,]*[,]*[^,]*[,]*([^,]*)[,]*[^,]*[,]*[^,]*[,]*[^,]*[,]*[^,]*$"),
2, 0, 1);
instances.addThruPipe(csvIterator); // data, label, name fields
int numTopics = 5;
ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01);
model.addInstances(instances);
model.setNumThreads(2);
model.setNumIterations(50);
model.estimate();
Alphabet dataAlphabet = instances.getDataAlphabet();
ArrayList<TopicAssignment> arrayTopics = model.getData();
for (int i = 0; i < 10; i++) {
System.out.println("---- document " + i);
FeatureSequence tokens = (FeatureSequence) model.getData().get(i).instance.getData();
LabelSequence topics = model.getData().get(i).topicSequence;
Formatter out = new Formatter(new StringBuilder(), Locale.US);
for (int position = 0; position < tokens.getLength(); position++) {
out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)),
topics.getIndexAtPosition(position));
}
System.out.println(out);
double[] topicDistribution = model.getTopicProbabilities(i);
ArrayList<TreeSet<IDSorter>> topicSortedWords = model.getSortedWords();
for (int topic = 0; topic < numTopics; topic++) {
Iterator<IDSorter> iterator = topicSortedWords.get(topic).iterator();
out = new Formatter(new StringBuilder(), Locale.US);
out.format("%d\t%.3f\t", topic, topicDistribution[topic]);
int rank = 0;
while (iterator.hasNext() && rank < 5) {
IDSorter idCountPair = iterator.next();
out.format("%s (%.0f) ", dataAlphabet.lookupObject(idCountPair.getID()), idCountPair.getWeight());
rank++;
}
System.out.println(out);
}
StringBuilder topicZeroText = new StringBuilder();
Iterator<IDSorter> iterator = topicSortedWords.get(0).iterator();
int rank = 0;
while (iterator.hasNext() && rank < 5) {
IDSorter idCountPair = iterator.next();
topicZeroText.append(dataAlphabet.lookupObject(idCountPair.getID()) + " ");
rank++;
}
}
答案 0 :(得分:2)
主题是在模型级别定义的,而不是在文档级别定义的。他们应该对所有人都一样。
看起来您的所有文字都是网址。在导入序列中添加PrintInputPipe
可能有助于调试。