Question

我打电话给以下人员：

  propertiesDF.select(
        col("timestamp"), col("coordinates")(0) as "lon", 
        col("coordinates")(1) as "lat", 
        col("properties.tide (above mllw)") as "tideAboveMllw",
        col("properties.wind speed") as "windSpeed")

这给了我以下错误：

org.apache.spark.sql.AnalysisException：没有这样的struct field潮流（高于mllw）在气温，大气压，露点，主波周期，平均波浪方向，名称，程序名称，显着的波浪高度，潮汐（高于mllw）：，能见度，水温度，风向，风速;

现在确实存在这样的结构域。（错误消息本身就是这样说的。）

这是架构：

 root
 |-- timestamp: long (nullable = true)
 |-- coordinates: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- properties: struct (nullable = true)
 |    |-- air temperature: double (nullable = true)
 |    |-- atmospheric pressure: double (nullable = true)
 |    |-- dew point: double (nullable = true)
          .
          .
          .
 |    |-- tide (above mllw):: string (nullable = true)
          .
          .
          .

输入读作JSON，如下所示：

val df = sqlContext.read.json(dirName)

如何处理列名中的括号？

Answer 1

首先应避免使用此类名称，但可以拆分访问路径：

val df = spark.range(1).select(struct(
  lit(123).as("tide (above mllw)"),
  lit(1).as("wind speed")
).as("properties"))

df.select(col("properties").getItem("tide (above mllw)"))

// or

df.select(col("properties")("tide (above mllw)"))

或用反引号括起问题字段：

df.select(col("properties.`tide (above mllw)`"))

两种解决方案都假定数据包含以下结构（基于您用于查询的访问路径）：

df.printSchema
// root
//  |-- properties: struct (nullable = false)
//  |    |-- tide (above mllw): integer (nullable = false)
//  |    |-- wind speed: integer (nullable = false)

Answer 2

根据the documentation，您可以尝试使用单引号。像这样：

 propertiesDF.select(
        col("timestamp"), col("coordinates")(0) as "lon", 
        col("coordinates")(1) as "lat", 
        col("'properties.tide (above mllw)'") as "tideAboveMllw",
        col("properties.wind speed") as "windSpeed")

Scala Apache Spark：列名中的非标准字符

2 个答案: