我正在使用spark-dataframe上的Json文件。我试图解析Json Strings下面的文件:
{ “ID”: “00010005”, “TIME_VALUE”:864359000, “速度”:1079, “加速”:19, “LA”:36.1433530, “LO”: - 11.51577690} { “ID”: “00010005”, “TIME_VALUE”:864360000, “速度”:1176, “加速”:10, “LA”:36.1432660, “LO”: - 11.51578220} { “ID”: “00010005”, “TIME_VALUE”:864361000, “速度”:1175, “加速”:, “LA”:36.1431730, “LO”: - 11.51578840} { “ID”: “00010005”, “TIME_VALUE”:864362000, “速度”:1174, “加速”:, “LA”:36.1430780, “LO”: - 11.51579410} { “ID”: “00010005”, “TIME_VALUE”:864363000, “速度”:1285, “加速”:11, “LA”:36.1429890, “LO”: - 11.51580110}
这里加速度字段有时不包含任何值.Spark将那些json标记为Corrupt_record,它没有加速度值。
val df = sqlContext.read.json(data)
scala> df.show(20)
+--------------------+------------+--------+---------+-----------+-----+----------+
| _corrupt_record|acceleration| id| la| lo|speed|time_value|
+--------------------+------------+--------+---------+-----------+-----+----------+
| null| -1|00010005|36.143418|-11.5157712| 887| 864358000|
| null| 19|00010005|36.143353|-11.5157769| 1079| 864359000|
| null| 10|00010005|36.143266|-11.5157822| 1176| 864360000|
|{"id":"00010005",...| null| null| null| null| null| null|
|{"id":"00010005",...| null| null| null| null| null| null|
我不想放弃这些记录。阅读这些Json记录的正确方法是什么?
我尝试了下面的代码并将“加速”替换为“0”值。但它不是通用的解决方案来处理任何领域的价值可能缺失的情况。
val df1 = df.select("_corrupt_record").na.drop()
val stripRdd = df1.rdd.map( x => x.getString(0)).map(x=>x.replace(""""acceleration":""",""""acceleration":0"""))
val newDf = sqlContext.read.json(stripRdd)
val trimDf = df.drop("_corrupt_record").na.drop
val finalDf = trimDf.unionAll(newDf)
答案 0 :(得分:0)
如果已准备好记录架构,则可以很容易地做到这一点,比如说该架构称为SpeedRecord,其中包含以下字段:加速度,id,la,lo,速度,time_value
case class SpeedRecord(acceleration : Int, id : Long, la : Double , lo : Double, speed : Int, time_value : Long)
val schema = Encoders.bean(classOf[SpeedRecord]).schema
val speedRecord = spark.read.schema(schema).json("/path/data.json")
speedRecord.show()