我有一个数据框。
该数据框中所有列的数据类型都是字符串。有些列是jsonString
+--------+---------+--------------------------+
|event_id|event_key| rights |
+--------+---------+--------------------------+
| 410|(default)|{"conditions":[{"devic...|
+--------+---------+--------------------------+
我想单独解析jsonString并从中获取一个值并将其添加为新列。我正在使用Jackson解析器来做到这一点。
以下是"权利"
的价值 {
"conditions": [
{
"devices": [
{
"connection": [
"BROADBAND",
"MOBILE"
],
"platform": "IOS",
"type": "MOBILE",
"provider": "TELETV"
},
{
"connection": [
"BROADBAND",
"MOBILE"
],
"platform": "ANDROID",
"type": "MOBILE",
"provider": "TELETV"
},
{
"connection": [
"BROADBAND",
"MOBILE"
],
"platform": "IOS",
"type": "TABLET",
"provider": "TELETV"
},
{
"connection": [
"BROADBAND",
"MOBILE"
],
"platform": "ANDROID",
"type": "TABLET",
"provider": "TELETV"
}
],
"endDateTime": "2017-01-09T22:59:59.000Z",
"inclusiveGeoTerritories": [
"DE",
"IT",
"ZZ"
],
"mediaType": "Linear",
"offers": [
{
"endDateTime": "2017-01-09T22:59:59.000Z",
"isRestartable": true,
"isRecordable": true,
"isCUTVable": false,
"recordingMode": "UNIQUE",
"retentionCUTV": "P7DT2H",
"retentionNPVR": "P2Y6M5DT12H35M30S",
"offerId": "MOTOGP-RACE",
"offerType": "IPPV",
"startDateTime": "2017-01-09T17:00:00.000Z"
}
],
"platformName": "USA",
"startDateTime": "2017-01-09T17:00:00.000Z",
"territory": "USA"
}
]
}
现在我想在现有数据框架中创建一个新列。要添加的新列的名称是" provider"
conditions -> devices -> provider
我想在数据框中的非常行中执行此操作。因此我创建了一个UDF,我传递了将jsonString保存到该udf的列,并且在udf内部我想要解析json字符串并且需要 将值返回为字符串
我的火花代码:
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions._
import org.json4s._
import org.json4s.jackson.JsonMethods
import org.json4s.jackson.JsonMethods._
//
some codes to derive base dataframe
//
val fetchProvider_udf = udf(fetchProvider _)
val result = df.withColumn("provider",fetchProvider_udf(col("rights")))
result.select("event_id,"event_key","rights","provider").show(10)
def fetchProvider(jsonStr:String): String = {
val json = JsonMethods.parse(jsonStr)
val providerData = json \\ "conditions" \\"devices" \\ "provider"
compact(render(providerData))
}
如果导航键不可用,我该如何处理?它会抛出异常吗?让我们说"条件"在那里和"设备"是的,但"提供商"密钥不在json字符串中。那我该怎么处理呢?
有人可以帮助我
预期产出:
+--------+---------+-----------------------+-------------+
|event_id|event_key| rights |provider |
+--------+---------+-----------------------+-------------+
| 410|(unknown)|{"conditions":[{"devic...| TELETV |
+--------+---------+-----------------------+-------------+
但我得到以下输出
+--------+---------+-----------------------+------------------------------- ------------------------------------------------------+
|event_id|event_key| rights | provider |
+--------+---------+-----------------------+-------------------------- -----------------------------------------------------------+
| 410|(unknown)|{"conditions":[{"devic...| {"provider":"TELETV","provider":"TELETV","provider":"TELETV","provider":"TELETV" } |
+--------+---------+-----------------------+----------------------------- --------------------------------------------------------+
答案 0 :(得分:0)
如果要提取第一个提供者的值,则应在UDF中使用以下代码:
(json \\ "conditions" \\"devices")[0] \\ "provider"
当前代码只获取所有提供者(作为Map),然后将其转换为字符串作为UDF结果。
您还应该确保您的UDF不会引发任何异常(因为它会导致整个作业失败)。最简单的方法是返回null然后:
df.provider.isNull()
df.provider.isNullNull()