如何使VectorAssembler不压缩数据?

时间:2018-01-30 08:44:24

标签: scala apache-spark apache-spark-sql spark-dataframe apache-spark-mllib

我想使用\documentclass{article} \begin{document} ${\frac{5 }{ 2}}$ This is another fraction: ${\frac{1 }{ 2}}$. And another: ${\frac{A = a }{ b}}$ What about: $${\frac{{\frac{D }{ C}} }{ H}}$$ Finally: $${\frac{e^{2c} }{ x-1}}$$ ${\frac{2yz}{ 1+x^{2} }}$ $$\phi = 1 + {\frac{1 }{ {1 + {\frac{1 }{ {1 + {\frac{1 }{ {1 + \ddots}}}}}}}}}$$ \item $Dom\left(Q\right)\ne {\rm R}^{2} $ y uno de los puntos no pertenecientes al dominio es $\left({\frac{1}{ 2}} ,{\frac{1}{ 2}} \right).$ \end{document} 将多列转换为一列,但默认情况下会压缩数据而不使用其他选项。

VectorAssembler

输入是:

val arr2= Array((1,2,0,0,0),(1,2,3,0,0),(1,2,4,5,0),(1,2,2,5,6))
val df=sc.parallelize(arr2).toDF("a","b","c","e","f")
val colNames=Array("a","b","c","e","f")
val assembler = new VectorAssembler()
  .setInputCols(colNames)
  .setOutputCol("newCol")
val transDF= assembler.transform(df).select(col("newCol"))
transDF.show(false)

结果是:

  +---+---+---+---+---+
  |  a|  b|  c|  e|  f|
  +---+---+---+---+---+
  |  1|  2|  0|  0|  0|
  |  1|  2|  3|  0|  0|
  |  1|  2|  4|  5|  0|
  |  1|  2|  2|  5|  6|
  +---+---+---+---+---+

我期待的结果是:

+---------------------+
|newCol               |
+---------------------+
|(5,[0,1],[1.0,2.0])  |
|[1.0,2.0,3.0,0.0,0.0]|
|[1.0,2.0,4.0,5.0,0.0]|
|[1.0,2.0,2.0,5.0,6.0]|
+---------------------+

我应该怎么做才能得到我期望的结果?

1 个答案:

答案 0 :(得分:1)

如果你真的想要强制所有向量到它们的密集表示,你可以使用用户定义函数来做到这一点:

val toDense = udf((v: org.apache.spark.ml.linalg.Vector) => v.toDense)
transDF.select(toDense($"newCol")).show

+--------------------+
|         UDF(newCol)|
+--------------------+
|[1.0,2.0,0.0,0.0,...|
|[1.0,2.0,3.0,0.0,...|
|[1.0,2.0,4.0,5.0,...|
|[1.0,2.0,2.0,5.0,...|
+--------------------+