爆炸以阵列形式出现的火花数据框列

时间:2018-06-18 08:34:23

标签: apache-spark dataframe

我有一个火花数据帧:

Id   |   events
----------------
A    |  [{"itemIds":["1","21","3"],"eventtype":"sp"},{"eventtype":"hp"},{"itemIds":["5"],"eventtype":"ip"}]

B    |  [{"itemIds":["8","10"],"eventtype":"bp"},{"eventtype":"atc"}]

这里Id和events列都是字符串类型。

如何将上面的数据框转换为下面的数据框(哪里“itemIds”不存在,则填充空值):

Id    |  itemIds  |   eventtype
---------------------------------

A     |     1     |     sp

A     |     21    |     sp

A     |     3     |     sp

A     |    null   |     hp

A     |     5     |     ip

B     |     8     |     bp

B     |     10    |     bp

B     |     null  |     atc

此处为Id,itemIds,事件类型列为String类型。

1 个答案:

答案 0 :(得分:0)

这是解决方案,其中使用json4j解析json然后展开项目。

implicit val formats = org.json4s.DefaultFormats

case class Items(itemIds: Array[String], eventtype: String)
case class Item(id: String, itemId: String, etype: String)

val df = Seq(("A","""[{"itemIds":["1","21","3"],"eventtype":"sp"},{"eventtype":"hp"},{"itemIds":["5"],"eventtype":"ip"}]"""),("B","""[{"itemIds":["8","10"],"eventtype":"bp"},{"eventtype":"atc"}]""")).toDF("id","json")

df.flatMap(r => {
  implicit val formats = org.json4s.DefaultFormats
  org.json4s.jackson.JsonMethods.parse(r.getString(1)).extract[Array[Items]].flatMap(i => {
    if(i.itemIds.isEmpty)
      List(Item(r.getString(0), null, i.eventtype))
    else
      (0 until i.itemIds.size).map(j => Item(r.getString(0), i.itemIds(j), i.eventtype))
  })
})