Question

我的杰森：

{"apps": {"app": [{"id": "id1","user": "hdfs"}, {"id": "id2","user": "yarn"}]}}

模式：

root 
|-- apps: struct (nullable = true) 
| |-- app: array (nullable = true) 
| | |-- element: struct (containsNull = true) 
| | | |-- id: String (nullable = true) 
| | | |-- name: String (nullable = true)

我的代码：

StructType schema = new StructType()
                .add("apps",(new StructType()
                .add("app",(new StructType()))
                .add("element",new StructType().add("id",new StringType())add("user",new StringType())
                        )));
Dataset<Row> df = sparkSession.read().schema(schema).json(<path_to_json>);

它给我这个错误：

Exception in thread "main" scala.MatchError: org.apache.spark.sql.types.StringType@1fca53a7 (of class org.apache.spark.sql.types.StringType)

df.show()应该告诉我：

id  user
id1 hdfs
id2 yarn

Answer 1

读取数据时不需要提供架构，Spark可以自动推断架构。但是，要获得所需的输出，必须进行一些操作。

首先，读取数据：

Dataset<Row> df = sparkSession.read().json("<path_to_json>");

使用explode将每个Array元素放在自己的行上，然后使用select将数据解包到单独的列中。

df.withColumn("app", explode($"apps.app"))
  .select("app.*")

这应该为您提供预期格式的数据框。

Answer 2

@saidu答案是正确的。尽管spark会自动推断出架构，但建议您明确提供架构。在这种情况下，由于两种类型都是字符串，因此它将起作用。以id的第一个值为整数的示例为例。因此，在推理中，它会考虑很长的时间。

Answer 3

我遇到了类似的问题，并且使用自动推断的模式不是解决方案（性能较差）。显然，发生错误是因为您正在使用new StringType()来构造本机类型。相反，您应该使用DataTypes单例的公共成员：

StructType schema = new StructType()
  .add("apps", new StructType()
    .add("app", new ArrayType(new StructType()
      .add("id", DataTypes.StringType)
      .add("name", DataTypes.StringType))
  ));

Dataset<Row> df = sparkSession
  .read()
  .schema(schema)
  .json("<path_to_json>");

将特定格式“结构数组的结构”的JSON文件解析为spark数据帧

3 个答案: