Question

我正在尝试使用Mahout对CSV文件进行分类，我的理解是，首先我需要将CSV中的数据转换为可由其中一个mahout分类算法使用的向量。我的CSV文件包含文本和类似字的值以及多个类。

enter image description here

我在这里搜索过，发现了一些关于如何做到这一点的模糊解释，但无法找到任何例子。有谁可以提供一个简单的例子来说明如何实现这一目标？或者是否有任何实用程序可以为您执行此操作？

我觉得这将是一项非常普遍的任务，但无法找到任何明确的例子。

非常感谢任何帮助。

Answer 1

你有一些文字和类似文字的价值，所以你应该使用20个新闻组的例子来获得灵感。这是一个很好的示例，您可以轻松地使用csv文件重现代码。

这是20个新闻组的最后一个mahout版本的工作链接：

https://github.com/jpatanooga/MahoutExamples/blob/master/src/main/java/com/cloudera/mahout/classification/sgd/TwentyNewsgroups.java

使用带有TokenSream对象更改的countWords方法进行改编，这是一个带有Mahout最后版本的工作代码：

private static void countWords(Analyzer analyzer, Collection<String> words, Reader in) throws IOException {

        // use the provided analyzer to tokenize the input stream
        TokenStream ts = analyzer.tokenStream("text", in);
        ts.addAttribute(CharTermAttribute.class);
        ts.reset();

        // for each word in the stream, minus non-word stuff, add word to collection
        while (ts.incrementToken()) {
            String s = ts.getAttribute(CharTermAttribute.class).toString();
            words.add(s);
        }
        ts.end();
        ts.close();

        /*overallCounts.addAll(words);*/
    }

我希望它会对你有所帮助。我使用这个例子来适应CSV文件并且它有效。

如何解析CSV文件，以便Mahout对其进行分类

1 个答案: