我有一个具有这种结构的json文件
root
|-- labels: struct (nullable = true)
| |-- compute.googleapis.com/resource_name: string (nullable = true)
| |-- container.googleapis.com/namespace_name: string (nullable = true)
| |-- container.googleapis.com/pod_name: string (nullable = true)
| |-- container.googleapis.com/stream: string (nullable = true)
我想将四个.....googleapis.com/...
提取到四列中。
我尝试过:
import org.apache.spark.sql.functions._
df = df.withColumn("resource_name", df("labels.compute.googleapis.com/resource_name"))
.withColumn("namespace_name", df("labels.compute.googleapis.com/namespace_name"))
.withColumn("pod_name", df("labels.compute.googleapis.com/pod_name"))
.withColumn("stream", df("labels.compute.googleapis.com/stream"))
我也尝试过,使labels
成为一个数组,该数组解决了第一个错误,该错误指出子级别不是array
或map
df2 = df.withColumn("labels", explode(array(col("labels"))))
.select(col("labels.compute.googleapis.com/resource_name").as("resource_name"), col("labels.compute.googleapis.com/namespace_name").as("namespace_name"), col("labels.compute.googleapis.com/pod_name").as("pod_name"), col("labels.compute.googleapis.com/stream").as("stream"))
我仍然收到此错误
org.apache.spark.sql.AnalysisException: No such struct field compute in compute.googleapis.com/resource_name .....
我知道Spark
认为每个点都是一个嵌套的级别,但是我该如何格式化compute.googleapis.com/resource_name
识别为级别名称而不是多级别的spark
。
我也尝试按照此处所述解决问题
How to get Apache spark to ignore dots in a query?
但这还不能解决我的问题。我有labels.compute.googleapis.com/resource_name,将反引号添加到compute.googleapis.com/resource_name仍然会出现相同的错误。
答案 0 :(得分:0)
重命名列(或子级别),然后执行withColumn
val schema = """struct<resource_name:string, namespace_name:string, pod_name:string, stream:string>"""
val df1 = df.withColumn("labels", $"labels".cast(schema))
答案 1 :(得分:-1)
您可以使用反引号`来隔离包含特殊字符(例如'。')的名称。您需要在标签之后使用反引号,因为它是父标记。
val extracted = df.withColumn("resource_name", df("labels.`compute.googleapis.com/resource_name`"))
.withColumn("namespace_name", df("labels.`container.googleapis.com/namespace_name`"))
.withColumn("pod_name", df("labels.`container.googleapis.com/pod_name`"))
.withColumn("stream", df("labels.`container.googleapis.com/stream`"))
extracted.show(10, false)
输出:
+--------------------+-------------+--------------+--------+------+
|labels |resource_name|namespace_name|pod_name|stream|
+--------------------+-------------+--------------+--------+------+
|[RN_1,NM_1,PM_1,S_1]|RN_1 |NM_1 |PM_1 |S_1 |
+--------------------+-------------+--------------+--------+------+
更新1 完整的工作示例。
import org.apache.spark.sql.functions._
val j_1 =
"""
|{ "labels" : {
| "compute.googleapis.com/resource_name" : "RN_1",
| "container.googleapis.com/namespace_name" : "NM_1",
| "container.googleapis.com/pod_name" : "PM_1",
| "container.googleapis.com/stream" : "S_1"
| }
|}
""".stripMargin
val df = spark.read.json(Seq(j_1).toDS)
df.printSchema()
val extracted = df.withColumn("resource_name", df("labels.`compute.googleapis.com/resource_name`"))
.withColumn("namespace_name", df("labels.`container.googleapis.com/namespace_name`"))
.withColumn("pod_name", df("labels.`container.googleapis.com/pod_name`"))
.withColumn("stream", df("labels.`container.googleapis.com/stream`"))
extracted.show(10, false)