有一个json数据源。以下是一行示例:
{
"PrimaryAcctNumber": "account1",
"AdditionalData": [
{
"Addrs": [
"an address for account1",
"the longest address in the address list for account1",
"another address for account1"
],
"AccountNumber": "Account1",
"IP": 2368971684
},
{
"Addrs": [
"an address for account2",
"the longest address in the address list for account2",
"another address for account2"
],
"AccountNumber": "Account2",
"IP": 9864766814
}
]
}
因此,当将其加载到spark DataFrame时,架构为:
root
|-- PrimaryAcctNumber: string (nullable = true)
|-- AdditionalData: array (nullable = true)
| |-- element: struct (containsNull = true)
我想使用spark使用以下逻辑基于colomn LongestAddressOfPrimaryAccount
创建一个名为AdditionalData (ArrayType[StructType])
的新列:
AccountNumber
属性等于行的PrimaryAcctNumber
,则LongestAddressOfPrimaryAccount
的值将是Addrs
数组中最长的字符串AccountNumber
属性不等于PrimaryAcctNumber
,则值为" N / A" 因此,对于上面的给定行,预期输出为:
{
"PrimaryAcctNumber": "account1",
"AdditionalData": [
{
"Addrs": [
"an address for account1",
"the longest address in the address list for account1",
"another address for account1"
],
"AccountNumber": "Account1",
"IP": 2368971684
},
{
"Addrs": [
"an address for account2",
"the longest address in the address list for account2",
"another address for account2"
],
"AccountNumber": "Account2",
"IP": 9864766814
}
],
"LongestAddressOfPrimaryAccount": "the longest address in the address list for account1"
}
可以使用UDF或地图功能。但这不是火花的最佳做法。
仅使用火花功能是否可行?类似的东西:
sourceDdf.withColumn("LongestAddressOfPrimaryAccount", coalesce(
longest(
get_field(iterate_array_for_match($"AdditionalData", "AccountNumber", $"PrimaryAcctNumber"), "Addrs")
)
, lit("N/A")))
答案 0 :(得分:2)
如果您的火花版本为2.2或更低,则必须为您的要求编写udf
函数,因为内置函数 更复杂且更慢< / em>(在某种意义上你需要组合更多内置函数)比使用udf
函数更慢。而且我不知道这种内置功能可以直接满足您的要求。
Databricks团队正在研究Nested Data Using Higher Order Functions in SQL,这些将在下一个版本中进行。
在此之前,如果您不希望自己的工作变得复杂,则必须编写udf
函数。