Spark 1.6以空值爆炸

时间:2017-10-20 15:49:45

标签: scala apache-spark spark-dataframe

我有一个我试图压扁的Dataframe。作为这个过程的一部分,我想爆炸它,所以如果我有一列数组,数组的每个值将用于创建一个单独的行。我知道我可以使用explode函数。但是,我有一个问题,该列包含空值,我使用火花1.6。以下是数据类型和我想要的示例:
我的数据:

id | ListOfRficAction| RficActionAttachment
_______________________________
1  | Luke            | [baseball, soccer]
2  | Lucy            | null

我想要

id | ListOfRficAction| RficActionAttachment
_______________________________
1  | Luke            | baseball
1  | Luke            | soccer
2  | Lucy            | null

我正在使用Spark 1.6(所以我不能使用explode_outer函数),我尝试使用explode但是我有以下错误:

 scala.MatchError: [null] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)

我也尝试:

df.withColumn("likes", explode(
  when(col("likes").isNotNull, col("likes"))
    // If null explode an array<string> with a single null
    .otherwise(array(lit(null).cast("string")))))

但我的DataFrame架构很复杂(我有字符串和长),所以强制转换功能不起作用。这是我的架构的一部分和我的错误:

 |-- RficActionAttachment: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ActivityFileAutoUpdFlg: string (nullable = true)
 |    |    |-- ActivityFileDate: string (nullable = true)
 |    |    |-- ActivityFileDeferFlg: string (nullable = true)
 |    |    |-- ActivityFileDockReqFlg: string (nullable = true)
 |    |    |-- ActivityFileDockStatFlg: string (nullable = true)
 |    |    |-- ActivityFileExt: string (nullable = true)
 |    |    |-- ActivityFileName: string (nullable = true)
 |    |    |-- ActivityFileRev: string (nullable = true)
 |    |    |-- ActivityFileSize: long (nullable = true)
 |    |    |-- ActivityFileSrcPath: string (nullable = true)
 |    |    |-- ActivityFileSrcType: string (nullable = true)
 |    |    |-- ActivityId: string (nullable = true)
 |    |    |-- AttachmentId: string (nullable = true)
 |    |    |-- Comment: string (nullable = true)

用户类抛出异常:

org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN isnotnull(ListOfRficAction.RficAction.ListOfRficActionAttachment.RficActionAttachment) THEN ListOfRficAction.RficAction.ListOfRficActionAttachment.RficActionAttachment ELSE array(ListOfRficAction.RficAction.ListOfRficActionAttachment.RficActionAttachment)' 

由于数据类型不匹配:THEN和ELSE表达式应该是相同的类型或对常见类型可强制执行;

知道我能做什么吗?

1 个答案:

答案 0 :(得分:1)

首先将列中的所有null值替换为array(null),然后使用explode。使用问题中的示例数据框:

val df = Seq((1, "Luke", Array("baseball", "soccer")), (2, "Lucy", null))
  .toDF("id", "ListOfRficAction", "RficActionAttachment")

df.withColumn("RficActionAttachment", 
    when($"RficActionAttachment".isNull, array(lit(null)))
    .otherwise($"RficActionAttachment"))
  .withColumn("RficActionAttachment", explode($"RficActionAttachment"))

这将提供所要求的结果:

+---+----------------+--------------------+
| id|ListOfRficAction|RficActionAttachment|
+---+----------------+--------------------+
|  1|            Luke|            baseball|
|  1|            Luke|              soccer|
|  2|            Lucy|                null|
+---+----------------+--------------------+