将Vector转换为数据框时出错
第一部分中提到的代码运行良好,但它是将矢量数据转换为数据框的非直观方式。
我想用我所知道的解决这个问题,即第二部分提到的Code。 你能帮忙吗
val rdd = sc.parallelize(data).map(a => Row(a))
rdd.take(1)
val fields = "features".split(" ").map(fields => StructField(fields,DoubleType, nullable =true))
val df = spark.createDataFrame(rdd, StructType(fields))
df.count()
我们不能像下面这样做
df: org.apache.spark.sql.DataFrame = [features: double]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 357.0 failed 4 times, most recent failure: Lost task 1.3 in stage 357.0 (TID 1243, datacouch, executor 3): java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: org.apache.spark.ml.linalg.DenseVector is not a valid external type for schema of double
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, features), DoubleType) AS features#6583
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:586)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:586)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
但是我收到如下错误
[Authorize]
答案 0 :(得分:1)
正如VectorUDT usage中明确解释的那样,如果您得到DataType
,Vector
的正确org.apache.spark.ml.linalg.SQLDataTypes.VectorType
为spark.createDataFrame(
rdd,
StructType(Seq(
StructField("features", org.apache.spark.ml.linalg.SQLDataTypes.VectorType)
))
)
:
POST