Question

我是一名中国学生，并且是mahout的新手。（请原谅我可怜的英语:-P）我在一个文件中有大约数千个格式化的中文文章。我想将它们聚类。

（mahout 1.0 hadoop 2.5.1）

首先，我使用了

    SequenceFile.Writer writer = SequenceFile.createWriter(conf,
            Writer.file(outPath), Writer.keyClass(Text.class),
            Writer.valueClass(Text.class));


    File[] news = input.listFiles();
    BufferedReader br;
    String data;
    Pattern pattern = Pattern.compile("^.*?@(?!http).*?@.*?(?=\\t)");
    Matcher matcher;
    for (int i = 0; i < news.length; i++) {
        br = new BufferedReader(new FileReader(news[i]));
        while ((data = br.readLine()) != null) {
            matcher = pattern.matcher(data);
            if (matcher.find()) {
                writer.append(new Text(matcher.group()), new Text(data  //matcher.group() returns title
                        .replaceAll("^.*?@.*?@.*?\\t|http.*?$", "")     //the value is content 
                        .replaceAll("@|\\s*", " ")));
            }
        }
        br.close();
    }
    writer.sync();
    writer.close();

然后我得到了包含所有文章的序列文件。

下一个代码

    int minSupport = 2;
    int minDf = 1;
    int maxDFPercent = 96;
    int maxNGramSize = 1;
    float minLLRValue = LLRReducer.DEFAULT_MIN_LLR;
    int reduceTasks = 1;
    int chunkSize = 200;
    float norm = 2;
    boolean sequentialAccessOutput = false;
    boolean namedVector = false;
    boolean logNormalize = false;
    //here I neglect something inessential
    Class<? extends Analyzer> analyzerClass = IKAnalyzer.class;

    DocumentProcessor.tokenizeDocuments(new Path(inputDir),
            analyzerClass.asSubclass(Analyzer.class), tokenizedPath, conf);

    DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
            new Path(outputDir), tfDirName, conf, minSupport, maxNGramSize,
            minLLRValue, norm, logNormalize, reduceTasks, chunkSize,
            sequentialAccessOutput, namedVector);

    Pair<Long[], List<Path>> docFrequenciesFeatures = TFIDFConverter
            .calculateDF(new Path(tfDirName), new Path(outputDir), conf,
                    chunkSize);

    TFIDFConverter.processTfIdf(new Path(tfDirName), new Path(outputDir),
            conf, docFrequenciesFeatures, minDf, maxDFPercent, norm,
            logNormalize, sequentialAccessOutput, namedVector, reduceTasks);

    Path vectorsFolder = new Path(outputDir, "tfidf-vectors");
    Path canopyCentroids = new Path(outputDir, "canopy-centroids");
    Path clusterOutput = new Path(outputDir, "clusters");

    CanopyDriver.run(conf, vectorsFolder, canopyCentroids,
            new CosineDistanceMeasure(), 0.7, 0.3, true, 0.1, false);
    KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
            "clusters-0-final"), clusterOutput, 0.01, 20, true, 0.1, false);

几分钟后，程序继续执行TFIDFConverter.processTfIdf（...），然后processTfIdf完成。我得到了文件part-r-00000，其大小只有90B。下一次调用canopy抛出java.lang.IndexOutOfBoundsException：索引：0，大小：0。

有谁知道我犯的错误？非常感谢:)）

mahout：在调用TFIDFConverter.processTfIdf（...）之后我什么也没得到

0 个答案: