Question

我使用spark-sql的DataFrame来实现通用数据集成组件。基本思想，用户通过命名它们来配置字段并使用简单的sql片段（可以出现在select子句中的片段）映射它们，组件添加这些列并将它们分组到struct字段中（使用来自列DSL的struct）

稍后处理会获取一些结构字段并将它们分组到一个数组中，此时我遇到了一个与一个元组中可以为空的字段相关的问题，而另一个字段中则无法为空。

因为字段在结构中分组，所以我能够提取结构类型，修改它并使用 Column.cast 方法将其应用回整个元组，I＆＃39; m不确定这种方法是否适用于顶级字段（顺便说一句，SQL强制语法不允许指定字段和可空性）。

我的问题是，有没有更好的方法来实现这一目标？像 nullable（）这样的函数可以应用于表达式，以便将其标记为可为空，类似于强制转换的方式。

示例代码：

val df = (1 to 8).map(x => (x,x+1)).toDF("x","y")
val df6 = df.select(
      functions.struct( $"x" + 1 as "x1", $"y" + 1 as "y1" ) as "struct1",
      functions.struct( $"x" + 1 as "x1", functions.lit(null).cast( DataTypes.IntegerType ) as "y1" ) as "struct2"
    )
val df7 = df6.select( functions.array($"struct1", $"struct2") as "arr" )

这个例外失败了：

无法解析＆＃39;数组（struct1，struct2）＆＃39;由于数据类型不匹配：函数数组的输入应该都是相同的类型，但它是 [struct，struct]; org.apache.spark.sql.AnalysisException：无法解决＆＃39;阵列（struct1，struct2）＆＃39;由于数据类型不匹配：输入到函数数组应该都是相同的类型，但它的[struct，结构];

，修复程序如下所示：

//val df7 = df6.select( functions.array($"struct1", $"struct2") as "arr" )
val df7 = df6.select( functions.array($"struct1"  cast df6.schema("struct2").dataType, $"struct2" ) as "arr" )

Answer 1

您可以使用创建udf Option[Int]的{{1}}使其更清洁一点：

val optionInt = udf[Option[Int],Int](i => Option(i))

然后，在为optionInt($"y" + 1)创建y1时，您需要使用struct1。其他所有内容保持不变（尽管为了简洁而进行了编辑）。

val df6 = df.select(
  struct($"x" + 1 as "x1", optionInt($"y" + 1) as "y1" ) as "struct1",
  struct($"x" + 1 as "x1", lit(null).cast(IntegerType) as "y1" ) as "struct2"
)

然后df6.select(array($"struct1", $"struct2") as "arr" )正常工作。

控制spark-sql和dataframes

1 个答案: