Question

我尝试从JSON读取数据，该数据具有lat，long值类似于[48.597315，-43.206085]的数组，我想在spark sql中将它们解析为单个字符串。我有办法做到这一点吗？

我的JSON输入将如下所示。

{"id":"11700","position":{"type":"Point","coordinates":[48.597315,-43.206085]}

我试图将其推送到rdbms商店，当我试图将position.coordinates转换为字符串时，它会给我

Can't get JDBC type for array<string>

目标数据类型为nvarchar。任何帮助表示赞赏。！

Answer 1

您可以将json文件读入DataFrame，然后1）使用concat_ws将lat / lon数组串行化为单个列，2）使用struct重新组合{{ 1}} struct-type列如下：

position

[UPDATE]

使用Spark SQL：

// jsonfile:
// {"id":"11700","position":{"type":"Point","coordinates":[48.597315,-43.206085]}}

import org.apache.spark.sql.functions._
val df = spark.read.json("/path/to/jsonfile")

// printSchema:
// root
//  |-- id: string (nullable = true)
//  |-- position: struct (nullable = true)
//  |    |-- coordinates: array (nullable = true)
//  |    |    |-- element: double (containsNull = true)
//  |    |-- type: string (nullable = true)

df.withColumn("coordinates", concat_ws(",", $"position.coordinates")).
  select($"id", struct($"coordinates", $"position.type").as("position")).
  show(false)
// +-----+----------------------------+
// |id   |position                    |
// +-----+----------------------------+
// |11700|[48.597315,-43.206085,Point]|
// +-----+----------------------------+

// printSchema:
// root
//  |-- id: string (nullable = true)
//  |-- position: struct (nullable = false)
//  |    |-- coordinates: string (nullable = false)
//  |    |-- type: string (nullable = true)

Answer 2

在将给定列加载到目标数据源之前，必须将其转换为字符串。例如，以下代码创建一个新列position.coordinates，其值为double的给定数组的连接字符串，方法是使用Array的toString并在之后删除括号。

df.withColumn("position.coordinates", regexp_replace($"position.coordinates".cast("string"), "\\[|\\]", ""))

或者，您可以使用UDF在Row个对象上创建自定义转换函数。这样您就可以维护列的嵌套结构。以下来源（答案编号2）可以让您了解如何为您的案例采用UDF：Spark UDF with nested structure as input parameter。

在spark sql中将一个Doubles数组转换为String

2 个答案: