为向量

时间:2017-11-30 06:36:15

标签: scala apache-spark

我有下面的表格,为了执行连接,我生成一个带rowId列的数字序列,但这会引发以下错误。我究竟做错了什么?请帮帮我。

fListVec: org.apache.spark.sql.DataFrame = [features: vector]
+-----------------------------------------------------------------------------+
|features                                                                     |
+-----------------------------------------------------------------------------+
|[2.5046410000000003,2.1487149999999997,1.0884870000000002,3.5877090000000003]|
|[0.9558040000000001,0.9843780000000002,0.545025,0.9979860000000002]          |
+-----------------------------------------------------------------------------+

代码:

import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

val fListrdd = fListVec.rdd
    .map{case Row(features: Vector) => features}
    .zipWithIndex()
    .toDF("features","rowId")    

fListrdd.createOrReplaceTempView("featuresTable")
val f = spark.sql("SELECT features, rowId from featuresTable")
f.show(false)

输出:

  

导入org.apache.spark.ml.linalg.Vector       import org.apache.spark.sql.Row       org.apache.spark.SparkException:作业因阶段失败而中止:阶段206.0中的任务0失败1次,最近失败:丢失任务   阶段206.0中的0.0(TID 1718,localhost,执行器驱动程序):scala.MatchError:   [[2.5046410000000003,2.1487149999999997,1.0884870000000002,3.5877090000000003]]   (班级   org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)         来自$$$$ 4896e3e877b134a87d9ee46b238e22 $$$$$ anonfun $ 1.apply(:193)         来自$$$$ 4896e3e877b134a87d9ee46b238e22 $$$$$ anonfun $ 1.apply(:193)         在scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:409)         在org.apache.spark.util.Utils $ .getIteratorSize(Utils.scala:1762)         在org.apache.spark.rdd.ZippedWithIndexRDD $$ anonfun $ 2.apply(ZippedWithIndexRDD.scala:52)         在org.apache.spark.rdd.ZippedWithIndexRDD $$ anonfun $ 2.apply(ZippedWithIndexRDD.scala:52)         在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1944)         在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1944)         在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)         在org.apache.spark.scheduler.Task.run(Task.scala:99)         在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:282)         在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)         at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)         在java.lang.Thread.run(Thread.java:748)       驱动程序堆栈跟踪:         在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1435)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1423)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1422)         在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)         在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)         在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802)         在scala.Option.foreach(Option.scala:257)         在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)         在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)         在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)         在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)         在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)         在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)         在org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)         在org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)         在org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)         在org.apache.spark.rdd.ZippedWithIndexRDD。(ZippedWithIndexRDD.scala:50)         在org.apache.spark.rdd.RDD $$ anonfun $ zipWithIndex $ 1.apply(RDD.scala:1293)         在org.apache.spark.rdd.RDD $$ anonfun $ zipWithIndex $ 1.apply(RDD.scala:1293)         在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)         在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)         在org.apache.spark.rdd.RDD.withScope(RDD.scala:362)         在org.apache.spark.rdd.RDD.zipWithIndex(RDD.scala:1292)         ... 101被忽略了       引起:scala.MatchError:[[2.5046410000000003,2.1487149999999999,1.0884870000000002,3.5877090000000003]]   (班级   org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)         来自$$$$ 4896e3e877b134a87d9ee46b238e22 $$$$$ anonfun $ 1.apply(:193)         来自$$$$ 4896e3e877b134a87d9ee46b238e22 $$$$$ anonfun $ 1.apply(:193)         在scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:409)         在org.apache.spark.util.Utils $ .getIteratorSize(Utils.scala:1762)         在org.apache.spark.rdd.ZippedWithIndexRDD $$ anonfun $ 2.apply(ZippedWithIndexRDD.scala:52)         在org.apache.spark.rdd.ZippedWithIndexRDD $$ anonfun $ 2.apply(ZippedWithIndexRDD.scala:52)         在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1944)         在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1944)         在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)         在org.apache.spark.scheduler.Task.run(Task.scala:99)         在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:282)         ......还有3个

预期产出:

 features                 |       rowId

[2.5046410000000003,...]            0
[0.9558040000000001,...]            1

2 个答案:

答案 0 :(得分:2)

您必须在中间编写map函数,以便为要创建的新dataframe定义 dataType

val fListrdd = fListVec.rdd
  .map{case Row(features) => features}
  .zipWithIndex()
  .map(x => (x._1.asInstanceOf[DenseVector], x._2.toInt))
  .toDF("features","rowId")

仅添加.map(x => (x._1.asInstanceOf[DenseVector], x._2.toInt))行。

您可以更进一步,创建一个dataset。我个人推荐dataset,因为数据集是类型安全和优化形式的数据框

为此你需要一个case class

case class features(features: DenseVector, rowId: Int)

并在我的上述解决方案中添加features字词,以便您可以致电.toDS api创建类型安全 dataset

val fListDS = fListVec.rdd
  .map{case Row(features: DenseVector) => features}
  .zipWithIndex()
  .map(x => features(x._1.asInstanceOf[DenseVector], x._2.toInt))
  .toDS

答案 1 :(得分:1)

你几乎就在那里 - 只需要指定正确的矢量类型DenseVector

import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.sql.Row

val fList = Seq(
  (Seq(2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003)),
  (Seq(0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002))
).toDF("features")

def seqToVec = udf(
  (s: Seq[Double]) => new DenseVector(s.toArray)
)

val fListVec = fList.withColumn("features", seqToVec($"features"))
// fListVec: org.apache.spark.sql.DataFrame = [features: vector]

val fListrdd = fListVec.rdd.
  map{ case Row(features: DenseVector) => features }.
  zipWithIndex.
  toDF("features", "rowId")  

fListrdd.show
// +--------------------+-----+
// |            features|rowId|
// +--------------------+-----+
// |[2.50464100000000...|    0|
// |[0.95580400000000...|    1|
// +--------------------+-----+