Question

我有下面的表格，为了执行连接，我生成一个带rowId列的数字序列，但这会引发以下错误。我究竟做错了什么？请帮帮我。

fListVec: org.apache.spark.sql.DataFrame = [features: vector]
+-----------------------------------------------------------------------------+
|features                                                                     |
+-----------------------------------------------------------------------------+
|[2.5046410000000003,2.1487149999999997,1.0884870000000002,3.5877090000000003]|
|[0.9558040000000001,0.9843780000000002,0.545025,0.9979860000000002]          |
+-----------------------------------------------------------------------------+

代码：

import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

val fListrdd = fListVec.rdd
    .map{case Row(features: Vector) => features}
    .zipWithIndex()
    .toDF("features","rowId")    

fListrdd.createOrReplaceTempView("featuresTable")
val f = spark.sql("SELECT features, rowId from featuresTable")
f.show(false)

输出：

导入org.apache.spark.ml.linalg.Vector import org.apache.spark.sql.Row org.apache.spark.SparkException：作业因阶段失败而中止：阶段206.0中的任务0失败1次，最近失败：丢失任务阶段206.0中的0.0（TID 1718，localhost，执行器驱动程序）：scala.MatchError： [[2.5046410000000003,2.1487149999999997,1.0884870000000002,3.5877090000000003]] （班级 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema）来自$$$$ 4896e3e877b134a87d9ee46b238e22 $$$$$ anonfun $ 1.apply（：193）来自$$$$ 4896e3e877b134a87d9ee46b238e22 $$$$$ anonfun $ 1.apply（：193）在scala.collection.Iterator $$ anon $ 11.next（Iterator.scala：409）在org.apache.spark.util.Utils $ .getIteratorSize（Utils.scala：1762）在org.apache.spark.rdd.ZippedWithIndexRDD $$ anonfun $ 2.apply（ZippedWithIndexRDD.scala：52）在org.apache.spark.rdd.ZippedWithIndexRDD $$ anonfun $ 2.apply（ZippedWithIndexRDD.scala：52）在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply（SparkContext.scala：1944）在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply（SparkContext.scala：1944）在org.apache.spark.scheduler.ResultTask.runTask（ResultTask.scala：87）在org.apache.spark.scheduler.Task.run（Task.scala：99）在org.apache.spark.executor.Executor $ TaskRunner.run（Executor.scala：282）在java.util.concurrent.ThreadPoolExecutor.runWorker（ThreadPoolExecutor.java:1149） at java.util.concurrent.ThreadPoolExecutor $ Worker.run（ThreadPoolExecutor.java:624）在java.lang.Thread.run（Thread.java:748）驱动程序堆栈跟踪：在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages（DAGScheduler.scala：1435）在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply（DAGScheduler.scala：1423）在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply（DAGScheduler.scala：1422）在scala.collection.mutable.ResizableArray $ class.foreach（ResizableArray.scala：59）在scala.collection.mutable.ArrayBuffer.foreach（ArrayBuffer.scala：48）在org.apache.spark.scheduler.DAGScheduler.abortStage（DAGScheduler.scala：1422）在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply（DAGScheduler.scala：802）在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply（DAGScheduler.scala：802）在scala.Option.foreach（Option.scala：257）在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed（DAGScheduler.scala：802）在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive（DAGScheduler.scala：1650）在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive（DAGScheduler.scala：1605）在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive（DAGScheduler.scala：1594）在org.apache.spark.util.EventLoop $$ anon $ 1.run（EventLoop.scala：48）在org.apache.spark.scheduler.DAGScheduler.runJob（DAGScheduler.scala：628）在org.apache.spark.SparkContext.runJob（SparkContext.scala：1918）在org.apache.spark.SparkContext.runJob（SparkContext.scala：1931）在org.apache.spark.SparkContext.runJob（SparkContext.scala：1944）在org.apache.spark.rdd.ZippedWithIndexRDD。（ZippedWithIndexRDD.scala：50）在org.apache.spark.rdd.RDD $$ anonfun $ zipWithIndex $ 1.apply（RDD.scala：1293）在org.apache.spark.rdd.RDD $$ anonfun $ zipWithIndex $ 1.apply（RDD.scala：1293）在org.apache.spark.rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：151）在org.apache.spark.rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：112）在org.apache.spark.rdd.RDD.withScope（RDD.scala：362）在org.apache.spark.rdd.RDD.zipWithIndex（RDD.scala：1292） ... 101被忽略了引起：scala.MatchError：[[2.5046410000000003,2.1487149999999999,1.0884870000000002,3.5877090000000003]] （班级 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema）来自$$$$ 4896e3e877b134a87d9ee46b238e22 $$$$$ anonfun $ 1.apply（：193）来自$$$$ 4896e3e877b134a87d9ee46b238e22 $$$$$ anonfun $ 1.apply（：193）在scala.collection.Iterator $$ anon $ 11.next（Iterator.scala：409）在org.apache.spark.util.Utils $ .getIteratorSize（Utils.scala：1762）在org.apache.spark.rdd.ZippedWithIndexRDD $$ anonfun $ 2.apply（ZippedWithIndexRDD.scala：52）在org.apache.spark.rdd.ZippedWithIndexRDD $$ anonfun $ 2.apply（ZippedWithIndexRDD.scala：52）在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply（SparkContext.scala：1944）在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply（SparkContext.scala：1944）在org.apache.spark.scheduler.ResultTask.runTask（ResultTask.scala：87）在org.apache.spark.scheduler.Task.run（Task.scala：99）在org.apache.spark.executor.Executor $ TaskRunner.run（Executor.scala：282） ......还有3个

预期产出：

 features                 |       rowId

[2.5046410000000003,...]            0
[0.9558040000000001,...]            1

Answer 1

您必须在中间编写map函数，以便为要创建的新dataframe定义 dataType

val fListrdd = fListVec.rdd
  .map{case Row(features) => features}
  .zipWithIndex()
  .map(x => (x._1.asInstanceOf[DenseVector], x._2.toInt))
  .toDF("features","rowId")

仅添加.map(x => (x._1.asInstanceOf[DenseVector], x._2.toInt))行。

您可以更进一步，创建一个dataset。我个人推荐dataset，因为数据集是类型安全和优化形式的数据框。

为此你需要一个case class

case class features(features: DenseVector, rowId: Int)

并在我的上述解决方案中添加features字词，以便您可以致电.toDS api创建类型安全 dataset。

val fListDS = fListVec.rdd
  .map{case Row(features: DenseVector) => features}
  .zipWithIndex()
  .map(x => features(x._1.asInstanceOf[DenseVector], x._2.toInt))
  .toDS

Answer 2

你几乎就在那里 - 只需要指定正确的矢量类型DenseVector：

import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.sql.Row

val fList = Seq(
  (Seq(2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003)),
  (Seq(0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002))
).toDF("features")

def seqToVec = udf(
  (s: Seq[Double]) => new DenseVector(s.toArray)
)

val fListVec = fList.withColumn("features", seqToVec($"features"))
// fListVec: org.apache.spark.sql.DataFrame = [features: vector]

val fListrdd = fListVec.rdd.
  map{ case Row(features: DenseVector) => features }.
  zipWithIndex.
  toDF("features", "rowId")  

fListrdd.show
// +--------------------+-----+
// |            features|rowId|
// +--------------------+-----+
// |[2.50464100000000...|    0|
// |[0.95580400000000...|    1|
// +--------------------+-----+

为向量

2 个答案: