Question

我想在从json读取时指定一个模式，但是当尝试将数字映射到Double时它失败了，我尝试了FloatType和IntType没有任何乐趣！

在推断架构时，客户ID设置为String，我想将其转换为Double

所以当df2显示

时df1被破坏了

另外，我需要这是通用的，因为我想将它应用于任何json，我指定以下架构作为我面临的问题的一个例子

import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
val testSchema = StructType(Array(StructField("customerid",DoubleType)))
val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
df1.show(1)
df2.show(1)

任何帮助都会受到赞赏，我相信我错过了一些明显的东西，但对于我的生活，我不知道它是什么！

让我澄清一下，我正在加载一个使用sparkContext.newAPIHadoopRDD保存的文件

因此在将RDD [JsonObject]应用于数据帧时将其转换为数据帧

Answer 1

由双引号括起的Json字段被视为String。如何将列转换为Double？。如果提供了预期将哪些列转换为Double的详细信息，则此铸造解决方案可以是通用的。

df1.select(df1("customerid").cast(DoubleType)).show()
+----------+
|customerid|
+----------+
|  535137.0|
+----------+

通过Spark在JSON上指定模式

1 个答案: