如何推断JSON文件的架构?

时间:2018-06-08 12:28:03

标签: java json apache-spark spark-streaming

我在Java中有以下字符串

{
    "header": {
        "gtfs_realtime_version": "1.0",
        "incrementality": 0,
        "timestamp": 1528460625,
        "user-data": "metra"
    },
    "entity": [{
            "id": "8424",
            "vehicle": {
                "trip": {
                    "trip_id": "UP-N_UN314_V1_D",
                    "route_id": "UP-N",
                    "start_time": "06:17:00",
                    "start_date": "20180608",
                    "schedule_relationship": 0
                },
                "vehicle": {
                    "id": "8424",
                    "label": "314"
                },
                "position": {
                    "latitude": 42.10085,
                    "longitude": -87.72896
                },
                "current_status": 2,
                "timestamp": 1528460601
            }
        }
    ]
}

代表JSON文档。我想在 Spark 数据框中推断流媒体应用的架构。

如何将字符串的字段与CSV文档(我可以调用.split(""))分开?

2 个答案:

答案 0 :(得分:2)

引用官方文档Schema inference and partition of streaming DataFrames/Datasets

  

默认情况下,基于文件的源的结构化流要求您指定架构,而不是依靠Spark自动推断它。此限制可确保即使在出现故障的情况下,也将使用一致的架构进行流式查询。对于临时用例,您可以通过将spark.sql.streaming.schemaInference设置为true来重新启用架构推断。

然后,您可以使用spark.sql.streaming.schemaInference配置属性来启用架构推断。我不确定它是否适用于JSON文件。

我通常做的是加载单个文件(在批量查询中并在开始流式查询之前)来推断架构。这应该适合你的情况。请执行以下操作。

// I'm leaving converting Scala to Java as a home exercise
val jsonSchema = spark
  .read
  .option("multiLine", true) // <-- the trick
  .json("sample.json")
  .schema
scala> jsonSchema.printTreeString
root
 |-- entity: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- vehicle: struct (nullable = true)
 |    |    |    |-- current_status: long (nullable = true)
 |    |    |    |-- position: struct (nullable = true)
 |    |    |    |    |-- latitude: double (nullable = true)
 |    |    |    |    |-- longitude: double (nullable = true)
 |    |    |    |-- timestamp: long (nullable = true)
 |    |    |    |-- trip: struct (nullable = true)
 |    |    |    |    |-- route_id: string (nullable = true)
 |    |    |    |    |-- schedule_relationship: long (nullable = true)
 |    |    |    |    |-- start_date: string (nullable = true)
 |    |    |    |    |-- start_time: string (nullable = true)
 |    |    |    |    |-- trip_id: string (nullable = true)
 |    |    |    |-- vehicle: struct (nullable = true)
 |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |-- label: string (nullable = true)
 |-- header: struct (nullable = true)
 |    |-- gtfs_realtime_version: string (nullable = true)
 |    |-- incrementality: long (nullable = true)
 |    |-- timestamp: long (nullable = true)
 |    |-- user-data: string (nullable = true)

诀窍是使用multiLine选项,因此整个文件是用于从中推断模式的单行。

答案 1 :(得分:-1)

使用

df = spark.read.json(r's3:// mypath /',originalsAsString ='true')