Question

我有一个数据集作为csv文件。它有大约50列，其中大多数是绝对的。我计划使用新的测试数据集运行RandomForest多类分类。

这样做的痛点在于处理分类变量。处理它们的最佳方法是什么？我阅读了Spark网站http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline中的Pipeline指南，该指南从硬编码序列创建了一个DataFrame，其特征还包括以空格分隔的字符串。这看起来非常具体，我想在使用HashingTF获取使用我所拥有的CSV文件的功能方面做同样的事情。

简而言之，我希望实现与链接相同的功能，但使用CSV文件。

有什么建议吗？

编辑：数据 - ＆gt; 50个功能，100k行，大部分是字母数字分类我对MLlib很新，因此很难从CSV中找到适合我的数据的管道。我尝试从文件中创建一个DataFrame，但对如何对分类列进行编码感到困惑。我的疑虑如下：

1. The example in the link above tokenizes the data ans uses it but I have a dataframe.
2. Also even if I try using  a StringIndexer , should I write an indexer for every column? Shouldn't there be one method which accepts multiple columns?
3. How will I get back the label from the String Indexer for showing the prediction?
5. For new test data, how will I keep consistent encoding for every column?

Answer 1

我建议看看功能转换器http://spark.apache.org/docs/ml-features.html，特别是StringIndexer和VectorAssembler。

Spark多级分类 - 分类变量

1 个答案: