Json Parsing在Spark UDF中抛出意外输出

时间:2017-09-23 14:47:00

标签: json parsing apache-spark

我有一个数据框。

该数据框中所有列的数据类型都是字符串。有些列是jsonString

 +--------+---------+--------------------------+
 |event_id|event_key|              rights      |
 +--------+---------+--------------------------+
 |     410|(default)|{"conditions":[{"devic...|
 +--------+---------+--------------------------+

我想单独解析jsonString并从中获取一个值并将其添加为新列。我正在使用Jackson解析器来做到这一点。

以下是"权利"

的价值
 {
"conditions": [
    {
        "devices": [
            {
                "connection": [
                    "BROADBAND",
                    "MOBILE"
                ],
                "platform": "IOS",
                "type": "MOBILE",
                "provider": "TELETV"
            },
            {
                "connection": [
                    "BROADBAND",
                    "MOBILE"
                ],
                "platform": "ANDROID",
                "type": "MOBILE",
                "provider": "TELETV"
            },
            {
                "connection": [
                    "BROADBAND",
                    "MOBILE"
                ],
                "platform": "IOS",
                "type": "TABLET",
                "provider": "TELETV"
            },
            {
                "connection": [
                    "BROADBAND",
                    "MOBILE"
                ],
                "platform": "ANDROID",
                "type": "TABLET",
                "provider": "TELETV"
            }
        ],
        "endDateTime": "2017-01-09T22:59:59.000Z",
        "inclusiveGeoTerritories": [
            "DE",
            "IT",
            "ZZ"
        ],
        "mediaType": "Linear",
        "offers": [
            {
                "endDateTime": "2017-01-09T22:59:59.000Z",
                "isRestartable": true,
                "isRecordable": true,
                "isCUTVable": false,
                "recordingMode": "UNIQUE",
                "retentionCUTV": "P7DT2H",
                "retentionNPVR": "P2Y6M5DT12H35M30S",
                "offerId": "MOTOGP-RACE",
                "offerType": "IPPV",
                "startDateTime": "2017-01-09T17:00:00.000Z"
            }
        ],
        "platformName": "USA",
        "startDateTime": "2017-01-09T17:00:00.000Z",
        "territory": "USA"
    }
 ]
}

现在我想在现有数据框架中创建一个新列。要添加的新列的名称是" provider"

 conditions -> devices -> provider

我想在数据框中的非常行中执行此操作。因此我创建了一个UDF,我传递了将jsonString保存到该udf的列,并且在udf内部我想要解析json字符串并且需要 将值返回为字符串

我的火花代码:

 import org.apache.spark.sql.functions.udf
 import org.apache.spark.sql.functions._
 import org.json4s._
 import org.json4s.jackson.JsonMethods
 import org.json4s.jackson.JsonMethods._


  //
     some codes to derive base dataframe
  //

  val fetchProvider_udf = udf(fetchProvider _)
  val result = df.withColumn("provider",fetchProvider_udf(col("rights")))
   result.select("event_id,"event_key","rights","provider").show(10)


  def fetchProvider(jsonStr:String): String = {

    val json = JsonMethods.parse(jsonStr)

   val providerData = json \\ "conditions" \\"devices" \\ "provider"

   compact(render(providerData))
  }

如果导航键不可用,我该如何处理?它会抛出异常吗?让我们说"条件"在那里和"设备"是的,但"提供商"密钥不在json字符串中。那我该怎么处理呢?

有人可以帮助我

预期产出:

 +--------+---------+-----------------------+-------------+
 |event_id|event_key|              rights     |provider     |
 +--------+---------+-----------------------+-------------+
 |     410|(unknown)|{"conditions":[{"devic...|    TELETV    |
 +--------+---------+-----------------------+-------------+

但我得到以下输出

 +--------+---------+-----------------------+-------------------------------     ------------------------------------------------------+
 |event_id|event_key|              rights        |                                                     provider     |
      +--------+---------+-----------------------+--------------------------      -----------------------------------------------------------+
 |     410|(unknown)|{"conditions":[{"devic...|    {"provider":"TELETV","provider":"TELETV","provider":"TELETV","provider":"TELETV"      }   |
   +--------+---------+-----------------------+-----------------------------       --------------------------------------------------------+

1 个答案:

答案 0 :(得分:0)

如果要提取第一个提供者的值,则应在UDF中使用以下代码:

(json \\ "conditions" \\"devices")[0] \\ "provider"

当前代码只获取所有提供者(作为Map),然后将其转换为字符串作为UDF结果。

您还应该确保您的UDF不会引发任何异常(因为它会导致整个作业失败)。最简单的方法是返回null然后:

  • 如果您想调查 - 按df.provider.isNull()
  • 过滤
  • 如果您只想保留有效的条目 - 按df.provider.isNullNull()
  • 过滤