SPARK-无法读取多行JSON(corrupt_record:字符串(nullable = true))

时间:2018-08-19 20:34:37

标签: json scala apache-spark

我正在寻找有关标题问题的建议。我已经读入数据块(https://docs.databricks.com/spark/latest/data-sources/read-json.html),可以将具有以下表达式的多行json读入数据框:

 println("2.2 Dataframe Multiline")
       MULTILINE MODE!!
    val df2=spark.read.option("multiline","true").option("charset","UTF-8").json("EXPORT1.json")
    df2.printSchema()

这对我不起作用。如果我从JSON中手动删除所有换行符,则将得到以下结果:

root
 |-- results: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- address_components: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- long_name: string (nullable = true)
 |    |    |    |    |-- short_name: string (nullable = true)
 |    |    |    |    |-- types: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |-- formatted_address: string (nullable = true)
 |    |    |-- geometry: struct (nullable = true)
 |    |    |    |-- bounds: struct (nullable = true)
 |    |    |    |    |-- northeast: struct (nullable = true)
 |    |    |    |    |    |-- lat: double (nullable = true)
 |    |    |    |    |    |-- lng: double (nullable = true)
 |    |    |    |    |-- southwest: struct (nullable = true)
 |    |    |    |    |    |-- lat: double (nullable = true)
 |    |    |    |    |    |-- lng: double (nullable = true)
 |    |    |    |-- location: struct (nullable = true)
 |    |    |    |    |-- lat: double (nullable = true)
 |    |    |    |    |-- lng: double (nullable = true)
 |    |    |    |-- location_type: string (nullable = true)
 |    |    |    |-- viewport: struct (nullable = true)
 |    |    |    |    |-- northeast: struct (nullable = true)
 |    |    |    |    |    |-- lat: double (nullable = true)
 |    |    |    |    |    |-- lng: double (nullable = true)
 |    |    |    |    |-- southwest: struct (nullable = true)
 |    |    |    |    |    |-- lat: double (nullable = true)
 |    |    |    |    |    |-- lng: double (nullable = true)
 |    |    |-- place_id: string (nullable = true)
 |    |    |-- types: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |-- status: string (nullable = true)+

这是我从Google下载的JSON示例:

{
   "results" : [
      {
         "address_components" : [
            {
               "long_name" : "30152",
               "short_name" : "30152",
               "types" : [ "postal_code" ]
            },
            {
               "long_name" : "Murcia",
               "short_name" : "Murcia",
               "types" : [ "locality", "political" ]
            },
            {
               "long_name" : "Murcia",
               "short_name" : "MU",
               "types" : [ "administrative_area_level_2", "political" ]
            },
            {
               "long_name" : "Region of Murcia",
               "short_name" : "Region of Murcia",
               "types" : [ "administrative_area_level_1", "political" ]
            },
            {
               "long_name" : "Spain",
               "short_name" : "ES",
               "types" : [ "country", "political" ]
            }
         ],
         "formatted_address" : "30152 Murcia, Spain",
         "geometry" : {
            "bounds" : {
               "northeast" : {
                  "lat" : 37.9659196,
                  "lng" : -1.1346723
               },
               "southwest" : {
                  "lat" : 37.9442828,
                  "lng" : -1.1687921
               }
            },
            "location" : {
               "lat" : 37.9569734,
               "lng" : -1.1496969
            },
            "location_type" : "APPROXIMATE",
            "viewport" : {
               "northeast" : {
                  "lat" : 37.9659196,
                  "lng" : -1.1346723
               },
               "southwest" : {
                  "lat" : 37.9442828,
                  "lng" : -1.1687921
               }
            }
         },
         "place_id" : "ChIJZbDcb0Z_Yw0RUK0TPnKvAhw",
         "types" : [ "postal_code" ]
      }
   ],
   "status" : "OK"
}

由于我想向Google提交许多请愿书,因此我无法手动删除细分线。

有人可以帮助我吗?预先感谢。

1 个答案:

答案 0 :(得分:0)

为了解决该问题,我所做的就是存储JSON并删除所有换行符:

以下类接收地址,组件,...,并将Geolocation请求写入JSON

class Geolocation(var Address: String, var Component: String, var APIKey: String,  var JSONName:Int ){
 val GeoLocURL_REQ="https://maps.googleapis.com/maps/api/geocode/json?address="+Address+"&components="+Component+"&key="+APIKey
  val filename=JSONName.toString+"_LatLon.json"
  val file = new File(filename)
  val bw = new BufferedWriter(new FileWriter(file))
  val svc = url(GeoLocURL_REQ)
  val response : Future[String] = Http(svc OK as.String)

  response onComplete {
    case Success(content) => {
      println("worked!" + content)
      bw.write(content.replaceAll("\\s", ""))  //con un \\n va
      //bw.write(content)
      bw.close()
    }
    case Failure(t) => {
      println("failed:! " + t.getMessage)
    }
  }
}

import dispatch._, Defaults._


  var APIKey="TYPE YOUR OWN API HERE"
    var PostalCode=30152
    var Localidad = "Murcia"
    val Component="postal_code="+PostalCode+"%7Ccountry=ES"  // "|" = %7C
    var Address=Localidad+"+"+PostalCode

    val geolocation= new Geolocation(Address,Component,APIKey, PostalCode )

希望这对某人有所帮助!