我正在尝试将json文件解析为csv文件。
结构有点复杂,我在scala中编写了一个spark程序来完成这项任务。 就像文档每行不包含一个json对象一样,我决定使用 wholeTextFiles 方法,如我发现的一些答案和帖子中所建议的那样。
val jsonRDD = spark.sparkContext.wholeTextFiles(fileInPath).map(x => x._2)
然后我在数据框中读取json内容
val dwdJson = spark.read.json(jsonRDD)
然后我想浏览json并将数据弄平。 这是来自dwdJson的模式
root
|-- meta: struct (nullable = true)
| |-- dimensions: struct (nullable = true)
| | |-- lat: long (nullable = true)
| | |-- lon: long (nullable = true)
| |-- directory: string (nullable = true)
| |-- filename: string (nullable = true)
|-- records: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- grids: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- gPt: array (nullable = true)
| | | | | |-- element: double (containsNull = true)
| | |-- time: string (nullable = true)
这是我最好的方法:
val dwdJson_e1 = dwdJson.select($"meta.filename", explode($"records").as("records_flat"))
val dwdJson_e2 = dwdJson_e1.select($"filename", $"records_flat.time",explode($"records_flat.grids").as("gPt"))
val dwdJson_e3 = dwdJson_e2.select($"filename", $"time", $"gPt.gPt")
val dwdJson_flat = dwdJson_e3.select($"filename"
,$"time"
,$"gPt".getItem(0).as("lat1")
,$"gPt".getItem(1).as("long1")
,$"gPt".getItem(2).as("lat2")
,$"gPt".getItem(3).as("long2")
,$"gPt".getItem(4).as("value"))
我是scala新手,我想知道我是否可以避免创建似乎效率低下的中间数据帧(dwdJson_e1,dwdJson_e2,dwdJson_e3)并且程序运行速度非常慢(与在笔记本电脑中运行的java解析器相比)。
另一方面,我无法找到解除这些嵌套数组的方法。
spark版本: 2.0.0 scala: 2.11.8 java: 1.8
**
**
这是我要转换的示例Json文件:
{
"meta" : {
"directory" : "weather/cosmo/de/grib/12/aswdir_s",
"filename" : "COSMODE_single_level_elements_ASWDIR_S_2018022312_000.grib2.bz2",
"dimensions" : {
"lon" : 589,
"time" : 3,
"lat" : 441
}
},
"records" : [ {
"grids" : [ {
"gPt" : [ 45.175, 13.55, 45.2, 13.575, 3.366295E-7 ]
}, {
"gPt" : [ 45.175, 13.575, 45.2, 13.6, 3.366295E-7 ]
}, {
"gPt" : [ 45.175, 13.6, 45.2, 13.625, 3.366295E-7 ]
} ],
"time" : "2018-02-23T12:15:00Z"
}, {
"grids" : [ {
"gPt" : [ 45.175, 13.55, 45.2, 13.575, 4.545918E-7 ]
}, {
"gPt" : [ 45.175, 13.575, 45.2, 13.6, 4.545918E-7 ]
}, {
"gPt" : [ 45.175, 13.6, 45.2, 13.625, 4.545918E-7 ]
}
],
"time" : "2018-02-23T12:30:00Z"
}
]
}
这是上面json的示例输出:
filename, time, lat1, long1, lat2, long2, value
ASWDIR_S_...,2018-02-23T12:15:00Z,45.175,13.55, 45.2, 13.575,3.366295E-7
ASWDIR_S_...,2018-02-23T12:15:00Z,45.175,13.575, 45.2, 13.6,3.366295E-7
ASWDIR_S_...,2018-02-23T12:15:00Z,45.175,13.6, 45.2, 13.625,3.366295E-7
ASWDIR_S_...,2018-02-23T12:30:00Z,45.175,45.175, 13.55, 45.2,13.575,4.545918E-7
ASWDIR_S_...,2018-02-23T12:30:00Z,45.175,45.175,13.575,45.2,13.6,4.545918E-7
ASWDIR_S_...,2018-02-23T12:30:00Z,45.175,45.175,13.6,45.2,13.625,4.545918E-7
任何帮助将不胜感激。 亲切的问候,
答案 0 :(得分:1)
我认为你的方法完全正确。
关于avoid create the intermediate dataframes
,您实际上可以连续编写语句而不会将其分解为中间数据帧,例如
val df = dwdJson.select($"meta.filename", explode($"records").as("record")).
select($"filename", $"record.time", explode($"record.grids").as("grids")).
select($"filename", $"time", $"grids.gpt").
select($"filename", $"time",
$"gpt"(0).as("lat1"),
$"gpt"(1).as("long1"),
$"gpt"(2).as("lat2"),
$"gpt"(3).as("long2"),
$"gpt"(4).as("value"))
我有一些想法是性能问题。
Spark在内部使用Jackson lib来解析json,它必须通过对输入的记录进行采样来干扰模式本身(默认采样率为1.0,即所有记录)。因此,如果您有大量输入,大文件(wholeTextFiles
操作)和复杂模式,它将影响火花程序的性能。
答案 1 :(得分:1)
您可以尝试以下代码。它为我提供了复杂的json文档
def flattenDataframe(df: DataFrame): DataFrame = {
val fields = df.schema.fields
val fieldNames = fields.map(x => x.name)
val length = fields.length
for(i <- 0 to fields.length-1){
val field = fields(i)
val fieldtype = field.dataType
val fieldName = field.name
fieldtype match {
case arrayType: ArrayType =>
val fieldNamesExcludingArray = fieldNames.filter(_!=fieldName)
val fieldNamesAndExplode = fieldNamesExcludingArray ++ Array(s"explode_outer($fieldName) as $fieldName")
// val fieldNamesToSelect = (fieldNamesExcludingArray ++ Array(s"$fieldName.*"))
val explodedDf = df.selectExpr(fieldNamesAndExplode:_*)
return flattenDataframe(explodedDf)
case structType: StructType =>
val childFieldnames = structType.fieldNames.map(childname => fieldName +"."+childname)
val newfieldNames = fieldNames.filter(_!= fieldName) ++ childFieldnames
val renamedcols = newfieldNames.map(x => (col(x.toString()).as(x.toString().replace(".", "_"))))
val explodedf = df.select(renamedcols:_*)
return flattenDataframe(explodedf)
case _ =>
}
}
df
}