我有大量没有嵌套的JSON,有超过200个键要转换&存储在结构表中。
|-- ip_address: string (nullable = true)
|-- xs_latitude: double (nullable = true)
|-- Applications: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- b_als_o_isehp: string (nullable = true)
| | |-- b_als_p_isehp: string (nullable = true)
| | |-- b_als_s_isehp: string (nullable = true)
| | |-- l_als_o_eventid: string (nullable = true)
....
读取JSON并获取每个ip_address
一个应用程序数组数据
{"ip_address": 1512199720,"Applications": [{"s_pd": -1,"s_path": "NA", "p_pd": "temp0"}, {"s_pd": -1,"s_path": "root/hdfs", "p_pd": "temp1"},{"s_pd": -1,"s_path": "root/hdfs", "p_pd": "temp2"}],}
val data = spark.read.json("file:///root/users/data/s_json.json")
var appDf = data.withColumn("data",explode($"Applications")).select($"Applications.s_pd", $"Applications.s_path", $"Applications.p_pd", $"ip_address")
appDf.printSchema
/// gives
root
|-- s_pd: array (nullable = true)
| |-- element: string (containsNull = true)
|-- s_path: array (nullable = true)
| |-- element: string (containsNull = true)
|-- p_pd: array (nullable = true)
| |-- element: string (containsNull = true)
|-- ip_address: string (nullable = true)
答案 0 :(得分:1)
您的错误是您正在使用原始(data
)struct列来选择单独列中的嵌套结构。
您必须从展开的列中选择var appDf = data.withColumn("data",explode($"Applications"))
.select($"ip_address", $"data.s_pd", $"data.s_path", $"data.p_pd")
+----------+----+---------+-----+
|ip_address|s_pd|s_path |p_pd |
+----------+----+---------+-----+
|1512199720|-1 |NA |temp0|
|1512199720|-1 |root/hdfs|temp1|
|1512199720|-1 |root/hdfs|temp2|
+----------+----+---------+-----+
你应该
import fileinput, re
filename = 'test_ip.txt'
with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
for line in file:
print(re.sub("-h \S+ -u", "-h YOUR_NEW_IP_HERE -u", line), end='')
我希望答案很有帮助