我必须抓取维基百科以获取国家/地区的HTML页面。我已经成功爬行了。现在要建立集群,我必须做KMeans。我正在使用Weka。
我已使用此代码将我的目录转换为arff格式: https://weka.wikispaces.com/file/view/TextDirectoryToArff.java 这是它的输出: enter image description here
然后我在Weka中打开该文件并使用以下参数执行StringToWordVector转换: 然后我表演了Kmeans。我得到的输出是:
=== Run information ===
Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 5000 -S 10
Relation: text_files_in_files-weka.filters.unsupervised.attribute.StringToWordVector-R1,2-W1000-prune-rate-1.0-C-T-I-N1-L-S-stemmerweka.core.stemmers.SnowballStemmer-M0-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"-weka.filters.unsupervised.attribute.StringToWordVector-R-W1000-prune-rate-1.0-C-T-I-N1-L-S-stemmerweka.core.stemmers.SnowballStemmer-M0-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"
Instances: 28
Attributes: 1040
[list of attributes omitted]
Test mode:evaluate on training data
=== Model and evaluation on training set ===
迭代次数:2 在群集误差平方内:1915.0448503841326 缺失值全局替换为mean / mode
群集质心:
Cluster#
Attribute Full Data 0 1
(28) (22) (6)
====================================================================================
.
.
.
.
.
bolsheviks 0.3652 0.3044 0.5878
book 0.3229 0.3051 0.3883
border 0.4329 0.5509 0
border-left-style 0.4329 0.5509 0
border-left-width 0.3375 0.4295 0
border-spacing 0.3124 0.3304 0.2461
border-width 0.5128 0.2785 1.372
boundary 0.309 0.3007 0.3392
brazil 0.381 0.3744 0.4048
british 0.4387 0.2232 1.2288
brown 0.2645 0.2945 0.1545
cache-control=max-age=87840 0.4913 0.4866 0.5083
california 0.5383 0.5085 0.6478
called 0.4853 0.6177 0
camp 0.4591 0.5451 0.1437
canada 0.3176 0.3358 0.251
canadian 0.2976 0.1691 0.7688
capable 0.2475 0.315 0
capita 0.388 0.1188 1.375
carbon 0.3889 0.445 0.1834
caribbean 0.4275 0.5441 0
carlsbad 0.548 0.5339 0.5998
caspian 0.4737 0.5345 0.2507
category 0.2216 0.2821 0
censorship 0.2225 0.0761 0.7596
center 0.4829 0.4074 0.7598
central 0.211 0.0805 0.6898
century 0.2645 0.2041 0.4862
chad 0.3636 0.0979 1.3382
challenger 0.5008 0.6374 0
championship 0.6834 0.8697 0
championships 0.2891 0.1171 0.9197
characteristics 0.237 0 1.1062
charon 0.5643 0.4745 0.8934
china
.
.
.
.
.
Time taken to build model (full training data) : 0.05 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 22 ( 79%)
1 6 ( 21%)
如何检查哪个DocId在哪个群集中?我搜索了很多但没找到任何东西。
此外,是否还有其他优秀的Kmeans Java库和聚集群集?