可变键的Spark json模式

时间:2018-07-31 05:51:25

标签: java json apache-spark jsonlines

我有一个json文件,如下所示:

{
  "123": [
    {
      "id": "123",
      "info": {
        "op": {
          "m": 1,
          "q": 2
        },
        "li": [
          "a",
          "b"
        ],
        "ad": [
          {
            "m": 1,
            "q": 2,
            "t": "text"
          },
          {
            "m": 1,
            "q": 2,
            "t": "abc"
          }
        ]
      },
      "dt": 1532494800000,
      "et": 1532494800000
    },
    {
      "id": "123",
      "info": {
        "op": {
          "m": 2,
          "q": 1
        },
        "li": [
          "a",
          "b"
        ],
        "ad": [
          {
            "m": 2,
            "q": 1,
            "t": "atext"
          },
          {
            "m": 10,
            "q": 2,
            "t": "abc"
          }
        ]
      },
      "dt": 1532494800000,
      "et": 1532494800000
    }
  ]
}

由于json对象以变量开头,因此我如何为此编写模式?对于文件中的每个json,spark都会创建新的架构对象。是不是性能瓶颈?

json以非结构化形式存在于文件中

{"123":[{"id":"123","info":{"op":{"m":1,"q":2},"li":["a","b"],"ad":[{"m":1,"q":2,"t":"text"},{"m":1,"q":2,"t":"abc"}]},"dt":1532494800000,"et":1532494800000},{"id":"123","info":{"op":{"m":2,"q":1},"li":["a","b"],"ad":[{"m":2,"q":1,"t":"atext"},{"m":10,"q":2,"t":"abc"}]},"dt":1532494800000,"et":1532494800000}]}

每行包含一个json对象的新行。 这是我到目前为止所拥有的:

public JavaRDD<MyObject> parseRecordFile(String path) {
    JavaRDD<Row> jsonRdd = getJsonRdd(path);
    JavaRDD<MyObject> map = jsonRdd.map(JsonReader::parseJsonStructure);
    return map;
  }

  public void jsonSchemaSpark() {
    //Don't know what to put here.
  }

  private JavaRDD<Row> getJsonRdd(String path) {
    Dataset<Row> jsonDS = sparkSession.read().format("json").load(path);
    return jsonDS.toJavaRDD();
  }

  private static MyObject parseJsonStructure(Row row) {
    log.info("Row starting");
    log.info("One row {}", row);
    log.info("Row end");
    return new MyObject();
  }

每一行中的每一行都像一个json对象吗?

0 个答案:

没有答案