我有一个火花数据帧:
Id | events
----------------
A | [{"itemIds":["1","21","3"],"eventtype":"sp"},{"eventtype":"hp"},{"itemIds":["5"],"eventtype":"ip"}]
B | [{"itemIds":["8","10"],"eventtype":"bp"},{"eventtype":"atc"}]
这里Id和events列都是字符串类型。
如何将上面的数据框转换为下面的数据框(哪里“itemIds”不存在,则填充空值):
Id | itemIds | eventtype
---------------------------------
A | 1 | sp
A | 21 | sp
A | 3 | sp
A | null | hp
A | 5 | ip
B | 8 | bp
B | 10 | bp
B | null | atc
此处为Id,itemIds,事件类型列为String类型。
答案 0 :(得分:0)
这是解决方案,其中使用json4j解析json然后展开项目。
implicit val formats = org.json4s.DefaultFormats
case class Items(itemIds: Array[String], eventtype: String)
case class Item(id: String, itemId: String, etype: String)
val df = Seq(("A","""[{"itemIds":["1","21","3"],"eventtype":"sp"},{"eventtype":"hp"},{"itemIds":["5"],"eventtype":"ip"}]"""),("B","""[{"itemIds":["8","10"],"eventtype":"bp"},{"eventtype":"atc"}]""")).toDF("id","json")
df.flatMap(r => {
implicit val formats = org.json4s.DefaultFormats
org.json4s.jackson.JsonMethods.parse(r.getString(1)).extract[Array[Items]].flatMap(i => {
if(i.itemIds.isEmpty)
List(Item(r.getString(0), null, i.eventtype))
else
(0 until i.itemIds.size).map(j => Item(r.getString(0), i.itemIds(j), i.eventtype))
})
})