我有一个json文件,如下所示:
{
"123": [
{
"id": "123",
"info": {
"op": {
"m": 1,
"q": 2
},
"li": [
"a",
"b"
],
"ad": [
{
"m": 1,
"q": 2,
"t": "text"
},
{
"m": 1,
"q": 2,
"t": "abc"
}
]
},
"dt": 1532494800000,
"et": 1532494800000
},
{
"id": "123",
"info": {
"op": {
"m": 2,
"q": 1
},
"li": [
"a",
"b"
],
"ad": [
{
"m": 2,
"q": 1,
"t": "atext"
},
{
"m": 10,
"q": 2,
"t": "abc"
}
]
},
"dt": 1532494800000,
"et": 1532494800000
}
]
}
由于json对象以变量开头,因此我如何为此编写模式?对于文件中的每个json,spark都会创建新的架构对象。是不是性能瓶颈?
json以非结构化形式存在于文件中
{"123":[{"id":"123","info":{"op":{"m":1,"q":2},"li":["a","b"],"ad":[{"m":1,"q":2,"t":"text"},{"m":1,"q":2,"t":"abc"}]},"dt":1532494800000,"et":1532494800000},{"id":"123","info":{"op":{"m":2,"q":1},"li":["a","b"],"ad":[{"m":2,"q":1,"t":"atext"},{"m":10,"q":2,"t":"abc"}]},"dt":1532494800000,"et":1532494800000}]}
每行包含一个json对象的新行。 这是我到目前为止所拥有的:
public JavaRDD<MyObject> parseRecordFile(String path) {
JavaRDD<Row> jsonRdd = getJsonRdd(path);
JavaRDD<MyObject> map = jsonRdd.map(JsonReader::parseJsonStructure);
return map;
}
public void jsonSchemaSpark() {
//Don't know what to put here.
}
private JavaRDD<Row> getJsonRdd(String path) {
Dataset<Row> jsonDS = sparkSession.read().format("json").load(path);
return jsonDS.toJavaRDD();
}
private static MyObject parseJsonStructure(Row row) {
log.info("Row starting");
log.info("One row {}", row);
log.info("Row end");
return new MyObject();
}
每一行中的每一行都像一个json对象吗?