在Spark / Scala中提取JSON数据

时间:2018-10-18 13:49:21

标签: json scala apache-spark

我有一个具有这种结构的json文件

root
 |-- labels: struct (nullable = true)
 |    |-- compute.googleapis.com/resource_name: string (nullable = true)
 |    |-- container.googleapis.com/namespace_name: string (nullable = true)
 |    |-- container.googleapis.com/pod_name: string (nullable = true)
 |    |-- container.googleapis.com/stream: string (nullable = true)

我想将四个.....googleapis.com/...提取到四列中。

我尝试过:

import org.apache.spark.sql.functions._
df = df.withColumn("resource_name", df("labels.compute.googleapis.com/resource_name"))
       .withColumn("namespace_name", df("labels.compute.googleapis.com/namespace_name"))
       .withColumn("pod_name", df("labels.compute.googleapis.com/pod_name"))
       .withColumn("stream", df("labels.compute.googleapis.com/stream"))

我也尝试过,使labels成为一个数组,该数组解决了第一个错误,该错误指出子级别不是arraymap

df2 = df.withColumn("labels", explode(array(col("labels"))))   
        .select(col("labels.compute.googleapis.com/resource_name").as("resource_name"), col("labels.compute.googleapis.com/namespace_name").as("namespace_name"), col("labels.compute.googleapis.com/pod_name").as("pod_name"), col("labels.compute.googleapis.com/stream").as("stream"))

我仍然收到此错误

org.apache.spark.sql.AnalysisException: No such struct field compute in compute.googleapis.com/resource_name .....

我知道Spark认为每个点都是一个嵌套的级别,但是我该如何格式化compute.googleapis.com/resource_name识别为级别名称而不是多级别的spark

我也尝试按照此处所述解决问题

How to get Apache spark to ignore dots in a query?

但这还不能解决我的问题。我有labels.compute.googleapis.com/resource_name,将反引号添加到compute.googleapis.com/resource_name仍然会出现相同的错误。

2 个答案:

答案 0 :(得分:0)

重命名列(或子级别),然后执行withColumn

val schema = """struct<resource_name:string, namespace_name:string, pod_name:string, stream:string>"""
val df1 = df.withColumn("labels", $"labels".cast(schema))

答案 1 :(得分:-1)

您可以使用反引号`来隔离包含特殊字符(例如'')的名称。您需要在标签之后使用反引号,因为它是父标记。

val extracted = df.withColumn("resource_name", df("labels.`compute.googleapis.com/resource_name`"))
    .withColumn("namespace_name", df("labels.`container.googleapis.com/namespace_name`"))
    .withColumn("pod_name", df("labels.`container.googleapis.com/pod_name`"))
    .withColumn("stream", df("labels.`container.googleapis.com/stream`"))

  extracted.show(10, false)

输出:

+--------------------+-------------+--------------+--------+------+
|labels              |resource_name|namespace_name|pod_name|stream|
+--------------------+-------------+--------------+--------+------+
|[RN_1,NM_1,PM_1,S_1]|RN_1         |NM_1          |PM_1    |S_1   |
+--------------------+-------------+--------------+--------+------+

更新1 完整的工作示例。

import org.apache.spark.sql.functions._
val j_1 =
  """
    |{ "labels" : {
    |   "compute.googleapis.com/resource_name" : "RN_1",
    |   "container.googleapis.com/namespace_name" : "NM_1",
    |   "container.googleapis.com/pod_name" : "PM_1",
    |   "container.googleapis.com/stream" : "S_1"
    |             }
    |}
  """.stripMargin

  val df = spark.read.json(Seq(j_1).toDS)
  df.printSchema()

  val extracted = df.withColumn("resource_name", df("labels.`compute.googleapis.com/resource_name`"))
    .withColumn("namespace_name", df("labels.`container.googleapis.com/namespace_name`"))
    .withColumn("pod_name", df("labels.`container.googleapis.com/pod_name`"))
    .withColumn("stream", df("labels.`container.googleapis.com/stream`"))

  extracted.show(10, false)