我在StackExchange帖子上进行标记预测和关键字提取。我有~36,000个帖子,包括标题,正文和标签。我处理过滤掉嘈杂的元素。在此之后,我执行标记潜在Dirichlet分配(LLDA),获得here。
在查看输出时,主题关键字分配的前半部分大部分都非常好,例如:
Topic 0: Hardware
hardware 0.01417490938078998
apple 0.007714736647543383
macbook 0.004179344296774437
mac 0.003794235182959134
Topic 1: Mac
mac 0.09533364420104305
os 0.02075003721054881
mini 0.00682593613383348
macs 0.00435445224274711
Topic 2: PowerPC
powerpc 0.010548590021130589
ppc 0.007893573342376935
mac 0.0039821054483700795
ibook 0.003731934198917873
os 0.003471650527888505
但是,我越接近输出文件的末尾,主题 - 关键字分配就完全奇怪了:
Topic 976: Shopping-recommendation
difference 7.5409094336777E-5
intel 7.5409094336777E-5
ppc 7.5409094336777E-5
turn 7.5409094336777E-5
Topic 977: PCI-Card
difference 7.5409094336777E-5
intel 7.5409094336777E-5
ppc 7.5409094336777E-5
turn 7.5409094336777E-5
Topic 978: Tmux
difference 7.5409094336777E-5
intel 7.5409094336777E-5
ppc 7.5409094336777E-5
turn 7.5409094336777E-5
Topic 979:
difference 7.5409094336777E-5
intel 7.5409094336777E-5
ppc 7.5409094336777E-5
turn 7.5409094336777E-5
有人可以解释为什么我最终会得到如此错误的作业?而且,为什么价值极低?
如前所述,我有~36,000个帖子,这些是执行LLDA的值:
option.est = true;
option.alpha = 50/920 // 920 is number of topics
option.beta = 0.1;
option.niters = 3000;
option.twords = 15;
option.nburnin = 350;
option.samplingLag = 256;
我发现很少甚至没有关于以前的值的文档,所以通过反复试验我发现这些符合我设法得到的最好的。但是,也许有更好理解的人可以向我解释和/或建议什么样的价值最好?