Question

我试图用嵌套字段展平现有数据框架的模式。我的数据框架结构如下：

root  
|-- Id: long (nullable = true)  
|-- Type: string (nullable = true)  
|-- Uri: string (nullable = true)    
|-- Type: array (nullable = true)  
|    |-- element: string (containsNull = true)  
|-- Gender: array (nullable = true)  
|    |-- element: string (containsNull = true)

类型和性别可以包含元素数组，一个元素或空值。我尝试使用以下代码：

var resDf = df.withColumn("FlatType", explode(df("Type")))

但是在结果数据框中导致我松散的行，我的Type列为空值。这意味着，例如，如果我有10行，7行中的类型为null，而3类型中的类型不为null，则在结果数据框中使用explode后，我只有3行。

如何保持行值为空值但会爆炸值数组？

我找到了某种解决方法但仍停留在一个地方。对于标准类型，我们可以执行以下操作：

def customExplode(df: DataFrame, field: String, colType: String): org.apache.spark.sql.Column = {
var exploded = None: Option[org.apache.spark.sql.Column]
colType.toLowerCase() match {
  case "string" => 
    val avoidNull = udf((column: Seq[String]) =>
    if (column == null) Seq[String](null)
    else column)
    exploded = Some(explode(avoidNull(df(field))))
  case "boolean" => 
    val avoidNull = udf((xs: Seq[Boolean]) =>
    if (xs == null) Seq[Boolean]()
    else xs)
    exploded = Some(explode(avoidNull(df(field))))
  case _ => exploded = Some(explode(df(field)))
}
exploded.get

}

然后就这样使用它：

val explodedField = customExplode(resultDf, fieldName, fieldTypeMap(field))
resultDf = resultDf.withColumn(newName, explodedField)

但是，对于以下类型的结构，我遇到结构类型的问题：

 |-- Address: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- AddressType: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true) 
 |    |    |-- DEA: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- Number: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- ExpirationDate: array (nullable = true)
 |    |    |    |    |    |-- element: timestamp (containsNull = true)
 |    |    |    |    |-- Status: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)

当DEA为空时，我们如何处理这种模式？

提前谢谢。

P.S。我尝试使用横向视图，但结果是一样的。

Answer 1

也许您可以尝试使用when：

val resDf = df.withColumn("FlatType", when(df("Type").isNotNull, explode(df("Type")))

如when函数documentation所示，为与条件不匹配的值插入值null。

在apache spark Data Frame中爆炸数组

1 个答案: