当我偶然发现与Dataframe保存相关的问题时,我正在将我的代码从Spark 2.0迁移到2.1。
这是代码
import org.apache.spark.sql.types._
import org.apache.spark.ml.linalg.VectorUDT
val df = spark.createDataFrame(Seq(Tuple1(1))).toDF("values")
val toSave = new org.apache.spark.ml.feature.VectorAssembler().setInputCols(Array("values")).transform(df)
toSave.write.csv(path)
使用Spark 2.0.0时,此代码成功
使用Spark 2.1.0.cloudera1,我收到以下错误:
java.lang.UnsupportedOperationException: CSV data source does not support struct<type:tinyint,size:int,indices:array<int>,values:array<double>> data type.
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.org$apache$spark$sql$execution$datasources$csv$CSVFileFormat$$verifyType$1(CSVFileFormat.scala:233)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$verifySchema$1.apply(CSVFileFormat.scala:237)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$verifySchema$1.apply(CSVFileFormat.scala:237)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:96)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.verifySchema(CSVFileFormat.scala:237)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.prepareWrite(CSVFileFormat.scala:121)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:484)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:520)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:579)
... 50 elided
这只是在我身边吗?
这与Spark 2.1的cloudera版本有关吗? (从他们的回购开始,似乎他们并没有搞乱spark.sql所以也许没有)
谢谢!
答案 0 :(得分:3)
以下答案由@ zero323评论组成。
CSV源不支持复杂对象。正如你从例外:CSV数据源不支持struct,values:array&gt;数据类型。是预期的行为。它不适用于Spark 2.x,虽然它曾经在1.x中使用spark-csv,其中向量已被转换为字符串。
以下jira SPARK-16216中的此行为是正确的。
答案 1 :(得分:-1)
作为一种解决方法,您可以使用此fork中的VectorDisassembler类,或采用描述为here的解决方案。
我使用VectorDisassembler将ml.feature.StandardScaler.fit方法的结果数据帧存储到CSV中。
代码看起来大致如下:
val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
val disassembledDF = disassembler.setInputCol("scaledFeatures").transform(df)
disassembledDF.show()