我有下面的表格,为了执行连接,我生成一个带rowId
列的数字序列,但这会引发以下错误。我究竟做错了什么?请帮帮我。
fListVec: org.apache.spark.sql.DataFrame = [features: vector]
+-----------------------------------------------------------------------------+
|features |
+-----------------------------------------------------------------------------+
|[2.5046410000000003,2.1487149999999997,1.0884870000000002,3.5877090000000003]|
|[0.9558040000000001,0.9843780000000002,0.545025,0.9979860000000002] |
+-----------------------------------------------------------------------------+
代码:
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
val fListrdd = fListVec.rdd
.map{case Row(features: Vector) => features}
.zipWithIndex()
.toDF("features","rowId")
fListrdd.createOrReplaceTempView("featuresTable")
val f = spark.sql("SELECT features, rowId from featuresTable")
f.show(false)
输出:
导入org.apache.spark.ml.linalg.Vector import org.apache.spark.sql.Row org.apache.spark.SparkException:作业因阶段失败而中止:阶段206.0中的任务0失败1次,最近失败:丢失任务 阶段206.0中的0.0(TID 1718,localhost,执行器驱动程序):scala.MatchError: [[2.5046410000000003,2.1487149999999997,1.0884870000000002,3.5877090000000003]] (班级 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) 来自$$$$ 4896e3e877b134a87d9ee46b238e22 $$$$$ anonfun $ 1.apply(:193) 来自$$$$ 4896e3e877b134a87d9ee46b238e22 $$$$$ anonfun $ 1.apply(:193) 在scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:409) 在org.apache.spark.util.Utils $ .getIteratorSize(Utils.scala:1762) 在org.apache.spark.rdd.ZippedWithIndexRDD $$ anonfun $ 2.apply(ZippedWithIndexRDD.scala:52) 在org.apache.spark.rdd.ZippedWithIndexRDD $$ anonfun $ 2.apply(ZippedWithIndexRDD.scala:52) 在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1944) 在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1944) 在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) 在org.apache.spark.scheduler.Task.run(Task.scala:99) 在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:282) 在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624) 在java.lang.Thread.run(Thread.java:748) 驱动程序堆栈跟踪: 在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1435) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1423) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1422) 在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) 在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802) 在scala.Option.foreach(Option.scala:257) 在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) 在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48) 在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) 在org.apache.spark.rdd.ZippedWithIndexRDD。(ZippedWithIndexRDD.scala:50) 在org.apache.spark.rdd.RDD $$ anonfun $ zipWithIndex $ 1.apply(RDD.scala:1293) 在org.apache.spark.rdd.RDD $$ anonfun $ zipWithIndex $ 1.apply(RDD.scala:1293) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112) 在org.apache.spark.rdd.RDD.withScope(RDD.scala:362) 在org.apache.spark.rdd.RDD.zipWithIndex(RDD.scala:1292) ... 101被忽略了 引起:scala.MatchError:[[2.5046410000000003,2.1487149999999999,1.0884870000000002,3.5877090000000003]] (班级 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) 来自$$$$ 4896e3e877b134a87d9ee46b238e22 $$$$$ anonfun $ 1.apply(:193) 来自$$$$ 4896e3e877b134a87d9ee46b238e22 $$$$$ anonfun $ 1.apply(:193) 在scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:409) 在org.apache.spark.util.Utils $ .getIteratorSize(Utils.scala:1762) 在org.apache.spark.rdd.ZippedWithIndexRDD $$ anonfun $ 2.apply(ZippedWithIndexRDD.scala:52) 在org.apache.spark.rdd.ZippedWithIndexRDD $$ anonfun $ 2.apply(ZippedWithIndexRDD.scala:52) 在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1944) 在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1944) 在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) 在org.apache.spark.scheduler.Task.run(Task.scala:99) 在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:282) ......还有3个
预期产出:
features | rowId
[2.5046410000000003,...] 0
[0.9558040000000001,...] 1
答案 0 :(得分:2)
您必须在中间编写map
函数,以便为要创建的新dataframe
定义 dataType
val fListrdd = fListVec.rdd
.map{case Row(features) => features}
.zipWithIndex()
.map(x => (x._1.asInstanceOf[DenseVector], x._2.toInt))
.toDF("features","rowId")
仅添加.map(x => (x._1.asInstanceOf[DenseVector], x._2.toInt))
行。
您可以更进一步,创建一个dataset
。我个人推荐dataset
,因为数据集是类型安全和优化形式的数据框。
为此你需要一个case class
case class features(features: DenseVector, rowId: Int)
并在我的上述解决方案中添加features
字词,以便您可以致电.toDS
api创建类型安全 dataset
。
val fListDS = fListVec.rdd
.map{case Row(features: DenseVector) => features}
.zipWithIndex()
.map(x => features(x._1.asInstanceOf[DenseVector], x._2.toInt))
.toDS
答案 1 :(得分:1)
你几乎就在那里 - 只需要指定正确的矢量类型DenseVector
:
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.sql.Row
val fList = Seq(
(Seq(2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003)),
(Seq(0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002))
).toDF("features")
def seqToVec = udf(
(s: Seq[Double]) => new DenseVector(s.toArray)
)
val fListVec = fList.withColumn("features", seqToVec($"features"))
// fListVec: org.apache.spark.sql.DataFrame = [features: vector]
val fListrdd = fListVec.rdd.
map{ case Row(features: DenseVector) => features }.
zipWithIndex.
toDF("features", "rowId")
fListrdd.show
// +--------------------+-----+
// | features|rowId|
// +--------------------+-----+
// |[2.50464100000000...| 0|
// |[0.95580400000000...| 1|
// +--------------------+-----+