将随机值列添加到Spark DataFrame

时间:2018-01-04 11:12:29

标签: scala apache-spark apache-spark-sql

如果我想在Spark 2.2中重命名DataFrame的列并使用show()打印其内容,我会收到以下错误:

18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'cluster' is backed by an array but the associated Spark Schema does not reflect this;
              (use es.read.field.as.array.include/exclude) 
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'project' is backed by an array but the associated Spark Schema does not reflect this;
              (use es.read.field.as.array.include/exclude) 
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'client' is backed by an array but the associated Spark Schema does not reflect this;
              (use es.read.field.as.array.include/exclude) 
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'twitter_mentioned_user' is backed by an array but the associated Spark Schema does not reflect this;
              (use es.read.field.as.array.include/exclude) 
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'author' is backed by an array but the associated Spark Schema does not reflect this;
              (use es.read.field.as.array.include/exclude) 
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'cluster' is backed by an array but the associated Spark Schema does not reflect this;
              (use es.read.field.as.array.include/exclude) 

18/01/04 12:05:37 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 7)
scala.MatchError: Buffer(13145439) (of class scala.collection.convert.Wrappers$JListWrapper)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)
    at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:61)
    at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:58)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
Caused by: scala.MatchError: Buffer(13145439) (of class scala.collection.convert.Wrappers$JListWrapper)

我打印了架构,它看起来如下:

df_processed
  .withColumn("srcId", toInt(df_processed("srcId")))
  .withColumn("dstId", toInt(df_processed("dstId")))
  .withColumn("attr", rand).printSchema()

输出:

root
 |-- srcId: integer (nullable = true)
 |-- dstId: integer (nullable = true)
 |-- attr: double (nullable = false)

运行此代码时出错:

df_processed
  .withColumn("srcId", toInt(df_processed("srcId")))
  .withColumn("dstId", toInt(df_processed("dstId")))
  .withColumn("attr", rand).show()

当我添加.withColumn("attr", rand)时会发生这种情况,但是当我使用.withColumn("attr2", lit(0))时它会起作用。

更新

df_processed.printSchema()
root
 |-- srcId: double (nullable = true)
 |-- dstId: double (nullable = true)

df_processed.show()不会出错。

2 个答案:

答案 0 :(得分:0)

以下是您尝试执行的类似示例,要转换数据类型,您可以使用cast函数

  val ds = Seq(
    (1.2, 3.5),
    (1.2, 3.5),
    (1.2, 3.5)
  ).toDF("srcId", "dstId")

  ds.withColumn("srcId", $"srcId".cast(IntegerType))
    .withColumn("dstId", $"dstId".cast(IntegerType))
    .withColumn("attr", rand)

希望这有帮助!

答案 1 :(得分:0)

您可以添加UDF函数:

input_layer3 = self.layer2.reshape(-1, 32*9*9)