Weka文档群集:文档ID在输出

时间:2015-12-07 03:42:07

标签: weka k-means hierarchical-clustering

我必须抓取维基百科以获取国家/地区的HTML页面。我已经成功爬行了。现在要建立集群,我必须做KMeans。我正在使用Weka。

我已使用此代码将我的目录转换为arff格式: https://weka.wikispaces.com/file/view/TextDirectoryToArff.java 这是它的输出: enter image description here

然后我在Weka中打开该文件并使用以下参数执行StringToWordVector转换: 然后我表演了Kmeans。我得到的输出是:

    === Run information ===

    Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 5000 -S 10
    Relation:     text_files_in_files-weka.filters.unsupervised.attribute.StringToWordVector-R1,2-W1000-prune-rate-1.0-C-T-I-N1-L-S-stemmerweka.core.stemmers.SnowballStemmer-M0-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"-weka.filters.unsupervised.attribute.StringToWordVector-R-W1000-prune-rate-1.0-C-T-I-N1-L-S-stemmerweka.core.stemmers.SnowballStemmer-M0-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"
    Instances:    28
    Attributes:   1040
    [list of attributes omitted]
    Test mode:evaluate on training data

=== Model and evaluation on training set ===

KMEANS

迭代次数:2 在群集误差平方内:1915.0448503841326 缺失值全局替换为mean / mode

群集质心:

                                                              Cluster#
Attribute                                            Full Data          0          1
                                                          (28)       (22)        (6)
====================================================================================
.
.
.
.
.
bolsheviks                                              0.3652     0.3044     0.5878
book                                                    0.3229     0.3051     0.3883
border                                                  0.4329     0.5509          0
border-left-style                                       0.4329     0.5509          0
border-left-width                                       0.3375     0.4295          0
border-spacing                                          0.3124     0.3304     0.2461
border-width                                            0.5128     0.2785      1.372
boundary                                                 0.309     0.3007     0.3392
brazil                                                   0.381     0.3744     0.4048
british                                                 0.4387     0.2232     1.2288
brown                                                   0.2645     0.2945     0.1545
cache-control=max-age=87840                             0.4913     0.4866     0.5083
california                                              0.5383     0.5085     0.6478
called                                                  0.4853     0.6177          0
camp                                                    0.4591     0.5451     0.1437
canada                                                  0.3176     0.3358      0.251
canadian                                                0.2976     0.1691     0.7688
capable                                                 0.2475      0.315          0
capita                                                   0.388     0.1188      1.375
carbon                                                  0.3889      0.445     0.1834
caribbean                                               0.4275     0.5441          0
carlsbad                                                 0.548     0.5339     0.5998
caspian                                                 0.4737     0.5345     0.2507
category                                                0.2216     0.2821          0
censorship                                              0.2225     0.0761     0.7596
center                                                  0.4829     0.4074     0.7598
central                                                  0.211     0.0805     0.6898
century                                                 0.2645     0.2041     0.4862
chad                                                    0.3636     0.0979     1.3382
challenger                                              0.5008     0.6374          0
championship                                            0.6834     0.8697          0
championships                                           0.2891     0.1171     0.9197
characteristics                                          0.237          0     1.1062
charon                                                  0.5643     0.4745     0.8934
china                                                  
.
.
.
.
.


Time taken to build model (full training data) : 0.05 seconds

=== Model and evaluation on training set ===

Clustered Instances

0      22 ( 79%)
1       6 ( 21%)

如何检查哪个DocId在哪个群集中?我搜索了很多但没找到任何东西。

此外,是否还有其他优秀的Kmeans Java库和聚集群集?

0 个答案:

没有答案