如何将数据帧转换为标签特征向量?

时间:2017-07-18 08:05:44

标签: scala apache-spark machine-learning

我在scala中运行逻辑回归模型,我有一个如下所示的数据框:

DF

+-----------+------------+
|x          |y           |
+-----------+------------+
|          0|           0|
|          0|          33|
|          0|          58|
|          0|          96|
|          0|           1|
|          1|          21|
|          0|          10|
|          0|          65|
|          1|           7|
|          1|          28|
+-----------+------------+

我需要将其转换为类似这样的内容

+-----+------------------+
|label|      features    | 
+-----+------------------+
|  0.0|(1,[1],[0])       |
|  0.0|(1,[1],[33])      |
|  0.0|(1,[1],[58])      |
|  0.0|(1,[1],[96])      |
|  0.0|(1,[1],[1])       |
|  1.0|(1,[1],[21])      |
|  0.0|(1,[1],[10])      |
|  0.0|(1,[1],[65])      |
|  1.0|(1,[1],[7])       |
|  1.0|(1,[1],[28])      | 
+-----------+------------+

我试过

 val lr = new LogisticRegression()
           .setMaxIter(10)
           .setRegParam(0.3)
           .setElasticNetParam(0.8)

      val assembler = new VectorAssembler()
  .setInputCols(Array("x"))
  .setOutputCol("Feature")
  var lrModel=  lr.fit(daf.withColumnRenamed("x","label").withColumnRenamed("y","features"))

感谢任何帮助。

1 个答案:

答案 0 :(得分:2)

鉴于dataframe

+---+---+
|x  |y  |
+---+---+
|0  |0  |
|0  |33 |
|0  |58 |
|0  |96 |
|0  |1  |
|1  |21 |
|0  |10 |
|0  |65 |
|1  |7  |
|1  |28 |
+---+---+

如下所示

val assembler =  new VectorAssembler()
  .setInputCols(Array("x", "y"))
  .setOutputCol("features")

  val output = assembler.transform(df).select($"x".cast(DoubleType).as("label"), $"features")
output.show(false)

会给你结果

+-----+----------+
|label|features  |
+-----+----------+
|0.0  |(2,[],[]) |
|0.0  |[0.0,33.0]|
|0.0  |[0.0,58.0]|
|0.0  |[0.0,96.0]|
|0.0  |[0.0,1.0] |
|1.0  |[1.0,21.0]|
|0.0  |[0.0,10.0]|
|0.0  |[0.0,65.0]|
|1.0  |[1.0,7.0] |
|1.0  |[1.0,28.0]|
+-----+----------+

现在使用LogisticRegression很容易

val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.3)
  .setElasticNetParam(0.8)

val lrModel = lr.fit(output)
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")

您的输出为

Coefficients: [1.5672602877378823,0.0] Intercept: -1.4055020984891717