Question

其中一个Json字段（下面的年龄）意味着一个表示为null的数字将在Dataframe printschema中作为字符串出现

输入json文件

{"AGE":null,"NAME":"abc","BATCH":190}
{"AGE":null,"NAME":"abc","BATCH":190}

Spark代码和输出

val df = spark.read.json("/home/white/tmp/a.json")
df.printSchema()
df.show()

*********************
OUTPUT
*********************
root
 |-- BATCH: long (nullable = true)
 |-- AGE: string (nullable = true)
 |-- NAME: string (nullable = true)

+-----+----+----+
|BATCH|AGE|NAME|
+-----+----+----+
|  190|null| abc|
|  190|null| abc|
+-----+----+----+

我希望年龄很长，目前我通过创建一个年龄字段为Long的新StructType并将数据帧重新创建为df.sqlContext.createDataFrame（df.rdd，newSchema）来实现此目标。我可以直接在spark.read.json api完成这项工作吗？

Answer 1

我认为最简单的方法如下：

spark.read.json("/home/white/tmp/a.json").withColumn("AGE", 'AGE.cast(LongType))

这会产生以下架构：

root
 |-- AGE: long (nullable = true)
 |-- BATCH: long (nullable = true)
 |-- NAME: string (nullable = true)

Spark对类型进行了最佳猜测，并且它会在JSON中看到null并且认为＆＃34; string＆＃34;因为String位于Scala对象层次结构的可空AnyRef侧，而Long位于非可空AnyVal侧。您只需要投射列以使Spark根据您的需要处理您的数据。

顺便说一下，为什么你使用Long而不是Int多年？那些人必须吃非常健康。

Answer 2

您可以创建一个案例类，并将其提供给要填充的read.json方法。这将为您提供DataSet（而非数据帧）

float(float, float)

参考：http://spark.apache.org/docs/latest/sql-programming-guide.html#creating-datasets

另一种选择是创建自己的InputReader而不是使用标准JSON阅读器。您已经在做的最后一个选项是添加额外的步骤来转换类型。

Answer 3

如果您已经知道哪些类型，我建议您使用预定义的架构进行阅读。

import org.apache.spark.sql.types._
val schema = StructType(List(
    StructField("AGE", IntegerType, nullable = true),
    StructField("BATCH", StringType, nullable = true),
    StructField("NAME", StringType, nullable = true)
))

spark.read.schema(schema).json("/home/white/tmp/a.json")

如何从json中加载具有null值的字段作为Dataframe中的数字

3 个答案: