Spark Scala嵌套存储为结构表的JSON

时间:2018-04-03 07:39:17

标签: json scala apache-spark dataframe

我有大量没有嵌套的JSON,有超过200个键要转换&存储在结构表中。

  |-- ip_address: string (nullable = true)
  |-- xs_latitude: double (nullable = true)
  |-- Applications: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- b_als_o_isehp: string (nullable = true)
  |    |    |-- b_als_p_isehp: string (nullable = true)
  |    |    |-- b_als_s_isehp: string (nullable = true)
  |    |    |-- l_als_o_eventid: string (nullable = true)
                 ....

读取JSON并获取每个ip_address一个应用程序数组数据

 {"ip_address": 1512199720,"Applications": [{"s_pd": -1,"s_path": "NA", "p_pd": "temp0"}, {"s_pd": -1,"s_path": "root/hdfs", "p_pd": "temp1"},{"s_pd": -1,"s_path": "root/hdfs", "p_pd": "temp2"}],}

val data = spark.read.json("file:///root/users/data/s_json.json")
 var appDf = data.withColumn("data",explode($"Applications")).select($"Applications.s_pd", $"Applications.s_path", $"Applications.p_pd", $"ip_address")
 appDf.printSchema
/// gives 
root
  |-- s_pd: array (nullable = true)
  |    |-- element: string (containsNull = true)
  |-- s_path: array (nullable = true)
  |    |-- element: string (containsNull = true)
  |-- p_pd: array (nullable = true)
  |    |-- element: string (containsNull = true)
  |-- ip_address: string (nullable = true)

在每个数据框记录中包含一个具有重复记录的数组。如何以表格格式获取记录。  enter image description here

1 个答案:

答案 0 :(得分:1)

错误

您的错误是您正在使用原始(data)struct列来选择单独列中的嵌套结构

解决方案

您必须从展开的列中选择var appDf = data.withColumn("data",explode($"Applications")) .select($"ip_address", $"data.s_pd", $"data.s_path", $"data.p_pd")

+----------+----+---------+-----+
|ip_address|s_pd|s_path   |p_pd |
+----------+----+---------+-----+
|1512199720|-1  |NA       |temp0|
|1512199720|-1  |root/hdfs|temp1|
|1512199720|-1  |root/hdfs|temp2|
+----------+----+---------+-----+

你应该

import fileinput, re
filename = 'test_ip.txt'
with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
    for line in file:
        print(re.sub("-h \S+ -u", "-h YOUR_NEW_IP_HERE -u", line), end='')

我希望答案很有帮助