Question

我遵循路透社数据集Clutering示例中的“Mahout in Action”并成功测试过。为了更多地了解群集，我尝试了相同的序列来聚集一些推文数据。

按照我使用的一系列命令：

mahout seqdirectory -c UTF-8 -i hdfs://-----:8020/user/hdfs/tweet/tweet.txt -o hdfs://-----:8020/user/hdfs/tweet/seqfiles

mahout seq2sparse -i hdfs://-----:8020/user/hdfs/tweet/seqfiles -o hdfs://----:8020/user/hdfs/tweet/vectors/ -ow -chunk 100 -x 90 -seq -ml 50 -n 2 -nv

mahout kmeans -i hdfs://---:8020/user/hdfs/tweet/vectors/tfidf-vectors/ -c kmeans-centroids -cl -o hdfs://-----:8020/user/hdfs/tweet/kmeans-clusters -k 3 -ow -x 3 -dm org.apache.mahout.common.distance.CosineDistanceMeasure

mahout clusterdump -i hdfs://----:8020/user/hdfs/tweet/kmeans-clusters/clusters-3-final -d hdfs://----:8020/user/hdfs/tweet/vectors/dictionary.file-0 -dt sequencefile -b 100 -n 10 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir hdfs://-----:8020/user/hdfs/tweet/kmeans-clusters/clusteredPoints -o tweet_outdump.txt

tweet_outdump.txt文件包含以下数据：

CL-0{n=1 c=[] r=[]}
Top Terms: 
Weight : [props - optional]: Point:
1.0: /tweet.txt =]
Inter-Cluster Density: NaN
Intra-Cluster Density: 0.0
CDbw Inter-Cluster Density: 0.0
CDbw Intra-Cluster Density: NaN
CDbw Separation: 0.0

即使我尝试过，这个命令：

mahout seqdumper -i hdfs://----:8020/user/hdfs/tweet/kmeans-clusters/clusteredPoints/part-m-00000

Key: 0: Value: 1.0: /tweet.txt =]
Count: 1

我真的很感激这里的一些反馈。提前致谢

Answer 1

您创建的数据集仅包含单个文档。

显然，聚类结果没有意义。没有“群集间距离”（因为只有一个群集）。并且簇内距离为0，因为只有一个对象，并且它与自身之间的距离为0。

所以你已经在seqdirectory命令失败了 - 你传递了一个文件，而不是每个文件有一个文件的目录......

Answer 2

关于您的情况，您的数据集似乎只包含一个大文件，其中文件的每一行代表（例如文档或文件）。因此，在这种情况下，Seqdirectory命令将生成一个只包含一个的顺序文件，这在我的帖子中是不合适的。因此，您应该首先编写一个简单的MapReduce代码来获取数据集并为数据的每一行分配一个id。在这里，您可以使用行偏移作为Id（键），值是行本身。此外，您必须将outputformat指定为Sequential。另一件事，你的outputkey必须是Text，你的值是在Text对象中包装的UTF-8编码字符串。这是一个简单的MapReduce代码：

public class TexToHadoopSeq {

    // Class Map1
    public static class mapper extends MapReduceBase implements
            Mapper<LongWritable, Text, Text, Text> {

        Text cle = new Text();
        Text valeur = new Text();

        @Override
        public void map(LongWritable key, Text values,
                OutputCollector<Text, Text> output, Reporter reporter)
                throws IOException {

            String record = values.toString();

            byte[] b = record.getBytes("UTF-8");

            valeur.set(b);
            cle.set(key.toString());
            output.collect(cle, valeur);

        }
    }

    // Class Reducer
    public static class Reduce1 extends MapReduceBase implements
            Reducer<Text, Text, Text, Text> {

        @Override
        public void reduce(Text key, Iterator<Text> values,
                OutputCollector<Text, Text> output, Reporter reporter)
                throws IOException {
            while (values.hasNext()) {

                output.collect(key, values.next());

            }

        }

    }

    public static void main(String[] args) throws IOException {

        String inputdata = args[0];


        System.out.println();
        System.out.println();

        // Start Job1
        JobClient client1 = new JobClient();
        JobConf conf1 = new JobConf(TexToHadoopSeq.class);

        FileInputFormat.setInputPaths(conf1, new Path(inputdata));// database
        FileOutputFormat.setOutputPath(conf1, new Path("output"));// Sortie Job1

        conf1.setJarByClass(TexToHadoopSeq.class);
        conf1.setMapperClass(mapper.class);
        conf1.setReducerClass(Reduce1.class);

        conf1.setNumReduceTasks(1);

        conf1.setMapOutputKeyClass(Text.class);
        conf1.setMapOutputValueClass(Text.class);

        conf1.setOutputKeyClass(Text.class);
        conf1.setOutputValueClass(Text.class);
        conf1.setInputFormat(TextInputFormat.class);
        conf1.setOutputFormat(SequenceFileOutputFormat.class);
        client1.setConf(conf1);
        RunningJob Job;
        Job = JobClient.runJob(conf1);
        Job.waitForCompletion();

        System.out.println();
        System.out.println();
        System.out.print("*****Conversion is Done*****");

    }

}

现在，下一步是从序列文件创建向量（由上面的代码生成），因此使用：./mahout seq2sparse -i "Directory of your sequential file in HDFS" -o "output" --maxDFPercent 85 --namedVector

然后，您将获得TFIDF目录...然后继续执行Kmeans或任何mahout聚类算法。就是这样。

需要有关Mahout群集的建议

2 个答案: