使用SCALA解析嵌套数据

时间:2017-11-13 11:41:34

标签: scala apache-spark

我的数据框如下:

ColA  ColB     ColC  
1     [2,3,4] [5,6,7]

我需要将其转换为下面的

ColA ColB ColC  
1    2    5  
1    3    6  
1    4    7  

有人可以帮助SCALA中的代码吗?

3 个答案:

答案 0 :(得分:2)

您可以通过zipUDF压缩列来explode两个数组列,如下所示:

val df = Seq(
  (1, Seq(2, 3, 4), Seq(5, 6, 7))
).toDF("ColA", "ColB", "ColC")

def zip = udf(
  (x: Seq[Int], y: Seq[Int]) => x zip y 
)

val df2 = df.select($"ColA", zip($"ColB", $"ColC").as("BzipC")).
  withColumn("BzipC", explode($"BzipC"))

val df3 = df2.select($"ColA", $"BzipC._1".as("ColB"), $"BzipC._2".as("ColC"))

df3.show
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|   1|   2|   5|
|   1|   3|   6|
|   1|   4|   7|
+----+----+----+

答案 1 :(得分:0)

我在这里提出的想法有点复杂,需要您使用map来合并ColBColC这两个数组。然后使用explode函数爆炸组合数组。最后将爆炸的组合数组提取到不同的列。

import org.apache.spark.sql.functions._
val tempDF = df.map(row => {
  val colB = row(1).asInstanceOf[mutable.WrappedArray[Int]]
  val colC = row(2).asInstanceOf[mutable.WrappedArray[Int]]
  var array = Array.empty[(Int, Int)]
  for(loop <- 0 to colB.size-1){
    array = array :+ (colB(loop), colC(loop))
  }
  (row(0).asInstanceOf[Int], array)
})
  .toDF("ColA", "ColB")
  .withColumn("ColD", explode($"ColB"))

tempDF.withColumn("ColB", $"ColD._1").withColumn("ColC", $"ColD._2").drop("ColD").show(false)

这会给你结果

+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|1   |2   |5   |
|1   |3   |6   |
|1   |4   |7   |
+----+----+----+

答案 2 :(得分:0)

您还可以结合使用HiveQL中的posexplodelateral view

sqlContext.sql(""" 
select 1 as colA, array(2,3,4) as colB, array(5,6,7) as colC 
""").registerTempTable("test")

sqlContext.sql(""" 
select 
    colA , b as colB, c as colC 
from 
    test 
lateral view 
    posexplode(colB) columnB as seqB, b 
lateral view 
    posexplode(colC) columnC as seqC, c 
where 
    seqB = seqC 
""" ).show

+----+----+----+
|colA|colB|colC|
+----+----+----+
|   1|   2|   5|
|   1|   3|   6|
|   1|   4|   7|
+----+----+----+

致谢:How to Extract Data From Files With JMeter;)