这是我的数据:
1.45000 lines(less than 100 words) single file.
2.Key: line ID
3.Value: line(String)
使用标准Mahout CLI将这些文档转换为vector(一切都很好) 参数:
Number of clusters: 6, Iteration:10
Result(ClusterDump): 155 Key:Value
任何人都可以帮我解决这个问题吗?
修改
示例数据:
No. data.
1 The MapReduce implementation of fuzzy k-means looks similar to that of the k-means.
2 Each entry in the sequence file has a key, which is the identifier of the vector.
...
45900 Fuzzy k-means has a parameter, m, called the fuzziness factor
转换为序列(使用Seqdumper验证)
<key:No.> <value:data>
...
45900
矢量转换
mahout-distribution-0.8/bin/mahout seq2sparse -i /user/hadoop/book-seq -o /user/hadoop/book-vector -ow -chunk 100 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 --namedVector
Kmeans群集
mahout-distribution-0.8/bin/mahout kmeans -i /user/hadoop/book-vector/tfidf-vectors -c /user/hadoop/book-initial-cluster -o /user/hadoop/book-kmeans-cluster -cd 0.1 -k 6 -x 10 -cl -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure
ClusterDump
Directory Structure
ClusteredPoints
Cluster-0
Cluster-1
Cluster-2-final
mahout-distribution-0.8/bin/mahout clusterdump -i /user/hadoop/book-kmeans-cluster/clusters-2-final -p /user/hadoop/book-kmeans-cluster/clusteredPoints -of TEXT -o clusterdump.txt -dm org.apache.mahout.common.distance.CosineDistanceMeasure
cat clusterdump.txt
155 Entries
更新
After vectorization, tfidf-vector is showing only 155 documents instead of ~ 45000