Question

我有一个像这样的csv文件：

0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.

我的目标是使用决策树来预测最后一列（正常或其他）

如您所见，并非我的csv文件中的所有字段都是相同的类型，有字符串，int和double。

起初我想创建一个RDD并像这样使用它：

def load_part1(file: String): RDD[(Int, String, String,String,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int, Int, Int, Double, Double, Double, Double, Double, Double, Double, Int, Int, Double, Double, Double, Double, Double, Double, Double, Double, String)] 
        val data = context.textFile(file)
        val res = data.map(x => {
            val s = x.split(",")
            (s(0).toInt, s(1), s(2), s(3), s(4).toInt, s(5).toInt, s(6).toInt, s(7).toInt, s(8).toInt, s(9).toInt, s(10).toInt, s(11).toInt, s(12).toInt, s(13).toInt, s(14).toInt, s(15).toInt, s(16).toInt, s(17).toInt, s(18).toInt, s(19).toInt, s(20).toInt, s(21).toInt, s(22).toInt, s(23).toInt, s(24).toDouble, s(25).toDouble, s(26).toDouble, s(27).toDouble, s(28).toDouble, s(29).toDouble, s(30).toDouble, s(31).toInt, s(32).toInt, s(33).toDouble, s(34).toDouble, s(35).toDouble, s(36).toDouble, s(37).toDouble, s(38).toDouble, s(39).toDouble, s(40).toDouble, s(41))
})
        .persist(StorageLevel.MEMORY_AND_DISK)
    return res
    }

但它不会接受它，因为元组在scala中不能超过22个字段。

现在我被困了，因为我不知道如何加载解析我的csv文件以将其用作训练和测试决策树。

当我查看spark doc上的决策树示例时，他们使用libsvm格式：这是我可以使用的唯一格式吗？因为事情就是：

并非我的所有功能都具有相同的类型：我是否需要将所有功能转换为相同的类型？
我的标签不是整数而是字符串，所以我需要将标签转换为整数才能使用决策树分类器吗？

我试着看一些像this one或this one这样的话题，但是它的完全不同，因为第一个链接的所有功能都具有相同的格式（双倍），而第二个我试过的像这样加载和解析我的数据：

 val csv = context.textFile("/home/hvfd8529/Datasets/KDDCup99/kddcup.data_10_percent_corrected")  // original file
 val data = csv.map(line => line.split(",").map(elem => elem.trim))

但是我的电脑花了差不多2分钟才完成它，除了它让它崩溃了吗？！

我正在考虑编写一个小的python代码，以便将所有字符串格式更改为整数，这样我就可以应用CSV2LibSVM python代码，然后使用决策树分类器，就像spar文档中的示例一样，但它是否真的必要？我不能直接使用我的csv文件吗？

我是scala和spark的新手:) 谢谢

Answer 1

以下是如何在spark 2.1中执行此操作首先定义csv的模式

        StructType schema = new StructType(new StructField[]{
                        new StructField("col1", DataTypes.StringType, true, Metadata.empty()),
                        new StructField("col2", DataTypes.DoubleType, true, Metadata.empty())})
        Dataset<Row> dataset = spark.read().format("csv").load("data.csv");
        StringIndexerModel indexer = new StringIndexer()
                        .setInputCol("col1")
                        .setOutputCol("col1Indexed").setHandleInvalid("skip").fit(data);
                VectorAssembler assembler = new VectorAssembler()
                        .setInputCols(new String[]{"col1Indexed","col2""})
                        .setOutputCol("features");

    //Prepare data
    Dataset<Row>[] splits = data.randomSplit(new double[]{0.7, 0.3});
            Dataset<Row> trainingData = splits[0];
            Dataset<Row> testData = splits[1];

            DecisionTreeRegressor dt = new DecisionTreeRegressor().setFeaturesCol("features").setLabelCol("commission").setPredictionCol("prediction");

    Pipeline pipeline = new Pipeline()
                    .setStages(new PipelineStage[]{indexer,assembler, dt});

            // Train model. This also runs the indexer.
            PipelineModel model = pipeline.fit(trainingData);

            // Make predictions.
            Dataset<Row> predictions = model.transform(testData);

基本上，您必须使用StringIndexer索引字符串功能并使用VectorAssembler合并新列。（代码在java中，但我认为非常简单）

Answer 2

您可以使用List[Any]：

def load_part1(file: String): RDD[List[Any]]
        val data = context.textFile(file)
        val res = data.map(x => {
            val s = x.split(",")
            List(s(0).toInt, s(1), s(2), s(3), s(4).toInt, s(5).toInt, s(6).toInt, s(7).toInt, s(8).toInt, s(9).toInt, s(10).toInt, s(11).toInt, s(12).toInt, s(13).toInt, s(14).toInt, s(15).toInt, s(16).toInt, s(17).toInt, s(18).toInt, s(19).toInt, s(20).toInt, s(21).toInt, s(22).toInt, s(23).toInt, s(24).toDouble, s(25).toDouble, s(26).toDouble, s(27).toDouble, s(28).toDouble, s(29).toDouble, s(30).toDouble, s(31).toInt, s(32).toInt, s(33).toDouble, s(34).toDouble, s(35).toDouble, s(36).toDouble, s(37).toDouble, s(38).toDouble, s(39).toDouble, s(40).toDouble, s(41))
})
        .persist(StorageLevel.MEMORY_AND_DISK)
    return res
    }

如果您事先知道文本字段的基数较低 - 如果您看到我的意思 - 您可以使用类似热门编码的方式对它们进行数字编码，并将您的整数转换为双精度，这样您就会返回{{1 }}

以下是关于单热编码的一些信息以及表示机器学习模型的分类数据的类似方法：http://www.kdnuggets.com/2015/12/beyond-one-hot-exploration-categorical-variables.html

在spark

2 个答案: