"热门词汇"实际上意味着mahout clusterdump的输出?

时间:2014-03-28 08:43:14

标签: cluster-computing mahout

我是mahout环境的新手...... 我得到了以下输出

/opt/hadoop/mahout-distribution-0.9/bin$ mahout clusterdump \
>    -d /app/hadoop/dmacs/training_set1_sparseout/dictionary.file-0 \
>    -dt sequencefile \
>    -i /app/hadoop/dmacs/training_set1_sparseout/kmeans-clusters/clusters-2-final \
>    -n 20 \
>    -b 100 \
>    -o /app/hadoop/dmacs/kmeans_final_output/cdump.txt \
>    -dm org.apache.mahout.common.distance.CosineDistanceMeasure   

:VL-1480{n=150 c=[1000062,3,2005:0.098, 1000079,1,2002:0.080, 1000079,2,2002:0.078, 1000079,3,2002:0.
    Top Terms:
            25                                      =>  10.670724073251089
            31                                      =>   7.999464999039968
            1664010,5,2005                          =>  1.2396535428365072
            2439493,1,2003                          =>   1.184131249586741
            507603,1,2005                           =>  0.9944797229766845
            199257,3,2005                           =>  0.9928587055206299
            2602249,3,2004                          =>  0.9890585215886434
            184705,3,2004                           =>  0.9728035926818848
            447759,5,2005                           =>  0.9652122163772583
            1152594,3,2004                          =>  0.9619592666625977
            104237,5,2005                           =>  0.9515269517898559
            1473980,3,2005                          =>  0.9478832610448201
            2118461,4,2005                          =>  0.9315701317787171
            1037245,3,2005                          =>  0.9236405754089355
            1639792,1,2002                          =>  0.9183504740397136
            1227322,1,2003                          =>  0.9121313015619914
            2019240,3,2004                          =>   0.909924259185791
            1117152,5,2005                          =>  0.9050878302256267
            2040853,3,2004                          =>  0.9025738382339478
            1309838,5,2005                          =>  0.8964522886276245

最高术语在输出中实际意味着什么。 在此先感谢!!!

1 个答案:

答案 0 :(得分:1)

最高级术语是指这些文档的前几个术语,它们是群集的一部分。您可以使用带有-n / -- numWords命令的clusterdump标志控制顶级术语输出。

有关标志的详细信息,请参阅帮助:

mahout-distribution-0.9$ bin/mahout clusterdump -h

另请查看类似的问题:Interpreting output from mahout clusterdumper