Question

我正在使用spark DataFrames并尝试对同一架构的DataFrame进行重复数据删除。

将DataFrame保存到JSON之前的

架构如下：

root
 |-- startTime: long (nullable = false)
 |-- name: string (nullable = true)

从JSON文件加载后的DataFrame架构如下：

root
 |-- name: string (nullable = true)
 |-- startTime: long (nullable = false)

我将JSON另存为：

newDF.write.json(filePath)

并回读为：

existingDF = sqlContext.read.json(filePath)

做了unionAll之后

existingDF.unionAll(newDF).distinct()

或

除外

newDF.except(existingDF)

由于架构更改，重复数据删除失败。

我可以避免此架构转换吗？有没有办法在保存到JSON文件并从JSON文件加载时保存（或强制执行）模式序列？

Answer 1

实施了一种解决方法，将架构转换回我需要的内容：

val newSchema = StructType(jsonDF.schema.map {
  case StructField(name, dataType, nullable, metadata) if name.equals("startTime") => StructField(name, LongType, nullable = false, metadata)
  case y: StructField => y
})
existingDF = sqlContext.createDataFrame(jsonDF.rdd, newSchema).select("startTime", "name")

在Spark DataFrame保存到JSON并加载回来，架构列序列更改

1 个答案: