如何使用Spark函数处理ArrayType中的复杂数据

时间:2018-03-08 04:15:41

标签: scala apache-spark apache-spark-sql apache-spark-dataset apache-spark-function

有一个json数据源。以下是一行示例:

{
  "PrimaryAcctNumber": "account1",
  "AdditionalData": [
    {
      "Addrs": [
        "an address for account1",
        "the longest address in the address list for account1",
        "another address for account1"
      ],
      "AccountNumber": "Account1",
      "IP": 2368971684
    },
    {
      "Addrs": [
        "an address for account2",
        "the longest address in the address list for account2",
        "another address for account2"
      ],
      "AccountNumber": "Account2",
      "IP": 9864766814
    }
  ]
}

因此,当将其加载到spark DataFrame时,架构为:

root
 |-- PrimaryAcctNumber: string (nullable = true)
 |-- AdditionalData: array (nullable = true)
 |    |-- element: struct (containsNull = true)

我想使用spark使用以下逻辑基于colomn LongestAddressOfPrimaryAccount创建一个名为AdditionalData (ArrayType[StructType])的新列:

  • 迭代AdditionalData
    • 如果AccountNumber属性等于行的PrimaryAcctNumber,则LongestAddressOfPrimaryAccount的值将是Addrs数组中最长的字符串
    • 如果AccountNumber属性不等于PrimaryAcctNumber,则值为" N / A"

因此,对于上面的给定行,预期输出为:

{
  "PrimaryAcctNumber": "account1",
  "AdditionalData": [
    {
      "Addrs": [
        "an address for account1",
        "the longest address in the address list for account1",
        "another address for account1"
      ],
      "AccountNumber": "Account1",
      "IP": 2368971684
    },
    {
      "Addrs": [
        "an address for account2",
        "the longest address in the address list for account2",
        "another address for account2"
      ],
      "AccountNumber": "Account2",
      "IP": 9864766814
    }
  ],
  "LongestAddressOfPrimaryAccount": "the longest address in the address list for account1"
}

可以使用UDF或地图功能。但这不是火花的最佳做法。

仅使用火花功能是否可行?类似的东西:

sourceDdf.withColumn("LongestAddressOfPrimaryAccount", coalesce(
  longest(
    get_field(iterate_array_for_match($"AdditionalData", "AccountNumber", $"PrimaryAcctNumber"), "Addrs")
  )
  , lit("N/A")))

1 个答案:

答案 0 :(得分:2)

如果您的火花版本为2.2或更低,则必须为您的要求编写udf函数,因为内置函数 更复杂且更慢< / em>(在某种意义上你需要组合更多内置函数)比使用udf函数更慢。而且我不知道这种内置功能可以直接满足您的要求。

Databricks团队正在研究Nested Data Using Higher Order Functions in SQL,这些将在下一个版本中进行。

在此之前,如果您不希望自己的工作变得复杂,则必须编写udf函数。