将Array中的Apache Spark列与struct数组中的struct合并

时间:2018-11-23 13:20:12

标签: scala apache-spark apache-spark-sql

这是传入数据流的架构。我正在使用spark 2.3.2流处理数据。

val schema = StructType(Seq(
            StructField("status", StringType),
            StructField("data", StructType(Seq(
                StructField("resultType", StringType),
                StructField("result", ArrayType(StructType(Array(
                    StructField("metric", StructType(Seq(StructField("application", StringType),
                                                         StructField("component", StringType),
                                                         StructField("instance", StringType)))), 
                    StructField("value", ArrayType(StringType))
                ))))
             )
         )))) 

这是我将架构应用于dstream的rdd的方式。

  val df = rdd.toDS()                        
                    .selectExpr("cast (value as string) as myData") 
                    .select(from_json($"myData", schema).as("myData"))               
                    .select($"myData.data.*")
                    .select("result")

上面的代码产生以下输出

{"result":[{"metric":{"application":"A","component":"S","instance":"tp01.net:9072"},"value":["1.542972576979E9","237006995456"]},
       {"metric":{"application":"A","component":"S","instance":"tp02.net:9072"},"value":["1.542972576979E9","237006995456"]},
       {"metric":{"application":"A","component":"S","instance":"tp03.net:9072"},"value":["1.542972576979E9","237006995456"]},
       {"metric":{"application":"B","component":"S","instance":"bp03.net:9072"},"value":["1.542972576979E9","270860144640"]},
       {"metric":{"application":"B","component":"S","instance":"bp04.net:9072"},"value":["1.542972576979E9","270860144640"]},
       {"metric":{"application":"B","component":"S","instance":"ps01.net:9072"},"value":["1.542972576979E9","135177400320"]},
 ]}

但是为了提取特征,我需要将以上内容转换为以下数据框

application     component       instance            value1              value2
A               S               tp01.net:9072       1.542972576979E9    237006995456
A               S               tp02.net:9072       1.542972576979E9    237006995456
A               S               tp03.net:9072       1.542972576979E9    237006995456
B               S               bp03.net:9072       1.542972576979E9    270860144640
B               S               bp04.net:9072       1.542972576979E9    270860144640
B               S               ps01.net:9072       1.542972576979E9    135177400320

如您所见,每一行已经是一个爆炸行。对如何将数组值和结构选择到单个数据框中有任何想法吗?

谢谢

0 个答案:

没有答案