Question

如果我想在Spark 2.2中重命名DataFrame的列并使用show()打印其内容，我会收到以下错误：

18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'cluster' is backed by an array but the associated Spark Schema does not reflect this;
              (use es.read.field.as.array.include/exclude) 
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'project' is backed by an array but the associated Spark Schema does not reflect this;
              (use es.read.field.as.array.include/exclude) 
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'client' is backed by an array but the associated Spark Schema does not reflect this;
              (use es.read.field.as.array.include/exclude) 
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'twitter_mentioned_user' is backed by an array but the associated Spark Schema does not reflect this;
              (use es.read.field.as.array.include/exclude) 
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'author' is backed by an array but the associated Spark Schema does not reflect this;
              (use es.read.field.as.array.include/exclude) 
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'cluster' is backed by an array but the associated Spark Schema does not reflect this;
              (use es.read.field.as.array.include/exclude) 

18/01/04 12:05:37 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 7)
scala.MatchError: Buffer(13145439) (of class scala.collection.convert.Wrappers$JListWrapper)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)
    at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:61)
    at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:58)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
Caused by: scala.MatchError: Buffer(13145439) (of class scala.collection.convert.Wrappers$JListWrapper)

我打印了架构，它看起来如下：

df_processed
  .withColumn("srcId", toInt(df_processed("srcId")))
  .withColumn("dstId", toInt(df_processed("dstId")))
  .withColumn("attr", rand).printSchema()

输出：

root
 |-- srcId: integer (nullable = true)
 |-- dstId: integer (nullable = true)
 |-- attr: double (nullable = false)

运行此代码时出错：

df_processed
  .withColumn("srcId", toInt(df_processed("srcId")))
  .withColumn("dstId", toInt(df_processed("dstId")))
  .withColumn("attr", rand).show()

当我添加.withColumn("attr", rand)时会发生这种情况，但是当我使用.withColumn("attr2", lit(0))时它会起作用。

更新

df_processed.printSchema()

root
 |-- srcId: double (nullable = true)
 |-- dstId: double (nullable = true)

df_processed.show()不会出错。

Answer 1

以下是您尝试执行的类似示例，要转换数据类型，您可以使用cast函数

  val ds = Seq(
    (1.2, 3.5),
    (1.2, 3.5),
    (1.2, 3.5)
  ).toDF("srcId", "dstId")

  ds.withColumn("srcId", $"srcId".cast(IntegerType))
    .withColumn("dstId", $"dstId".cast(IntegerType))
    .withColumn("attr", rand)

希望这有帮助！

Answer 2

您可以添加UDF函数：

input_layer3 = self.layer2.reshape(-1, 32*9*9)

将随机值列添加到Spark DataFrame

2 个答案: