标记的潜在Dirichlet分配输入值

时间:2014-03-24 12:39:00

标签: java machine-learning text-analysis topic-modeling

我在StackExchange帖子上进行标记预测和关键字提取。我有~36,000个帖子,包括标题,正文和标签。我处理过滤掉嘈杂的元素。在此之后,我执行标记潜在Dirichlet分配(LLDA),获得here

在查看输出时,主题关键字分配的前半部分大部分都非常好,例如:

Topic 0: Hardware
 hardware 0.01417490938078998
 apple  0.007714736647543383
 macbook    0.004179344296774437
 mac    0.003794235182959134

Topic 1: Mac
 mac    0.09533364420104305
 os 0.02075003721054881
 mini   0.00682593613383348
 macs   0.00435445224274711

Topic 2: PowerPC
 powerpc    0.010548590021130589
 ppc    0.007893573342376935
 mac    0.0039821054483700795
 ibook  0.003731934198917873
 os 0.003471650527888505

但是,我越接近输出文件的末尾,主题 - 关键字分配就完全奇怪了:

Topic 976: Shopping-recommendation
difference  7.5409094336777E-5
intel   7.5409094336777E-5
ppc 7.5409094336777E-5
turn    7.5409094336777E-5

Topic 977: PCI-Card
difference  7.5409094336777E-5
intel   7.5409094336777E-5
ppc 7.5409094336777E-5
turn    7.5409094336777E-5

Topic 978: Tmux
difference  7.5409094336777E-5
intel   7.5409094336777E-5
ppc 7.5409094336777E-5
turn    7.5409094336777E-5

Topic 979:
difference  7.5409094336777E-5
intel   7.5409094336777E-5
ppc 7.5409094336777E-5
turn    7.5409094336777E-5

有人可以解释为什么我最终会得到如此错误的作业?而且,为什么价值极低?

如前所述,我有~36,000个帖子,这些是执行LLDA的值:

option.est = true;
option.alpha = 50/920 // 920 is number of topics
option.beta = 0.1;
option.niters = 3000;
option.twords = 15;
option.nburnin = 350;
option.samplingLag = 256;

我发现很少甚至没有关于以前的值的文档,所以通过反复试验我发现这些符合我设法得到的最好的。但是,也许有更好理解的人可以向我解释和/或建议什么样的价值最好?

0 个答案:

没有答案