我是一名中国学生,并且是mahout的新手。(请原谅我可怜的英语:-P) 我在一个文件中有大约数千个格式化的中文文章。我想将它们聚类。
(mahout 1.0 hadoop 2.5.1)
首先,我使用了
SequenceFile.Writer writer = SequenceFile.createWriter(conf,
Writer.file(outPath), Writer.keyClass(Text.class),
Writer.valueClass(Text.class));
File[] news = input.listFiles();
BufferedReader br;
String data;
Pattern pattern = Pattern.compile("^.*?@(?!http).*?@.*?(?=\\t)");
Matcher matcher;
for (int i = 0; i < news.length; i++) {
br = new BufferedReader(new FileReader(news[i]));
while ((data = br.readLine()) != null) {
matcher = pattern.matcher(data);
if (matcher.find()) {
writer.append(new Text(matcher.group()), new Text(data //matcher.group() returns title
.replaceAll("^.*?@.*?@.*?\\t|http.*?$", "") //the value is content
.replaceAll("@|\\s*", " ")));
}
}
br.close();
}
writer.sync();
writer.close();
然后我得到了包含所有文章的序列文件。
下一个代码
int minSupport = 2;
int minDf = 1;
int maxDFPercent = 96;
int maxNGramSize = 1;
float minLLRValue = LLRReducer.DEFAULT_MIN_LLR;
int reduceTasks = 1;
int chunkSize = 200;
float norm = 2;
boolean sequentialAccessOutput = false;
boolean namedVector = false;
boolean logNormalize = false;
//here I neglect something inessential
Class<? extends Analyzer> analyzerClass = IKAnalyzer.class;
DocumentProcessor.tokenizeDocuments(new Path(inputDir),
analyzerClass.asSubclass(Analyzer.class), tokenizedPath, conf);
DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
new Path(outputDir), tfDirName, conf, minSupport, maxNGramSize,
minLLRValue, norm, logNormalize, reduceTasks, chunkSize,
sequentialAccessOutput, namedVector);
Pair<Long[], List<Path>> docFrequenciesFeatures = TFIDFConverter
.calculateDF(new Path(tfDirName), new Path(outputDir), conf,
chunkSize);
TFIDFConverter.processTfIdf(new Path(tfDirName), new Path(outputDir),
conf, docFrequenciesFeatures, minDf, maxDFPercent, norm,
logNormalize, sequentialAccessOutput, namedVector, reduceTasks);
Path vectorsFolder = new Path(outputDir, "tfidf-vectors");
Path canopyCentroids = new Path(outputDir, "canopy-centroids");
Path clusterOutput = new Path(outputDir, "clusters");
CanopyDriver.run(conf, vectorsFolder, canopyCentroids,
new CosineDistanceMeasure(), 0.7, 0.3, true, 0.1, false);
KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
"clusters-0-final"), clusterOutput, 0.01, 20, true, 0.1, false);
几分钟后,程序继续执行TFIDFConverter.processTfIdf(...),然后processTfIdf完成。我得到了文件part-r-00000,其大小只有90B。下一次调用canopy抛出java.lang.IndexOutOfBoundsException:索引:0,大小:0。
有谁知道我犯的错误?非常感谢:))