Spark-Scala在不使用所有JSON属性的情况下将JSON文件映射到案例类的数据集

时间:2018-11-14 13:43:22

标签: json scala apache-spark

我尝试创建case类以能够映射我的JSON文件的每一行,因此通过JSON文件创建一个RDD。 我只需要JSON文件中的一些数据即可创建案例类,但出现错误:

cannot resolve '`result`' due to data type mismatch: cannot cast ArrayType(StructType(StructField(hop,LongType,true), StructField(result,ArrayType(StructType(StructField(from,StringType,true), StructField(rtt,DoubleType,true), StructField(ttl,LongType,true)),true),true)),true) to ArrayType(StructType(StructField(hop,DecimalType(38,0),true), StructField(result,ArrayType(StructType(StructField(rtt,DoubleType,true)),true),true)),true);

JSON行如下:

{"lts": 165, "size": 40, "from": "89.105.202.4", "dst_name": "192.5.5.241", "fw": 4790, "proto": "UDP", "af": 4, "msm_name": "Traceroute", "stored_timestamp": 1514768539, "prb_id": 4247, "result": [{"result": [{"rtt": 1.955, "ttl": 255, "from": "89.105.200.50", "size": 28}, {"rtt": 1.7, "ttl": 255, "from": "10.10.0.5", "size": 28}, {"rtt": 1.709, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 1}, {"result": [{"rtt": 7.543, "ttl": 254, "from": "185.147.12.31", "size": 28}, {"rtt": 3.103, "ttl": 254, "from": "185.147.12.31", "size": 28}, {"rtt": 3.172, "ttl": 254, "from": "185.147.12.0", "size": 28}], "hop": 2}, {"result": [{"rtt": 4.347, "ttl": 253, "from": "185.147.12.19", "size": 28}, {"rtt": 2.876, "ttl": 253, "from": "185.147.12.19", "size": 28}, {"rtt": 3.143, "ttl": 253, "from": "185.147.12.19", "size": 28}], "hop": 3}, {"result": [{"rtt": 3.655, "ttl": 61, "from": "160.242.100.88", "size": 28}, {"rtt": 3.678, "ttl": 61, "from": "160.242.100.88", "size": 28}, {"rtt": 15.568, "ttl": 61, "from": "160.242.100.88", "size": 28}], "hop": 4}, {"result": [{"rtt": 4.263, "ttl": 60, "from": "196.216.48.144", "size": 28}, {"rtt": 6.082, "ttl": 60, "from": "196.216.48.144", "size": 28}, {"rtt": 11.834, "ttl": 60, "from": "196.216.48.144", "size": 28}], "hop": 5}, {"result": [{"rtt": 7.802, "ttl": 249, "from": "193.239.116.112", "size": 28}, {"rtt": 7.691, "ttl": 249, "from": "193.239.116.112", "size": 28}, {"rtt": 7.711, "ttl": 249, "from": "193.239.116.112", "size": 28}], "hop": 6}, {"result": [{"rtt": 8.228, "ttl": 57, "from": "192.5.5.241", "size": 28}, {"rtt": 8.026, "ttl": 57, "from": "192.5.5.241", "size": 28}, {"rtt": 8.254, "ttl": 57, "from": "192.5.5.241", "size": 28}], "hop": 7}], "timestamp": 1514768409, "src_addr": "89.105.202.4", "paris_id": 9, "endtime": 1514768403, "type": "traceroute", "dst_addr": "192.5.5.241", "msm_id": 5004}

我的代码如下:

package tests
//imports
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

object shell {

   case class Hop(
       hop:    BigInt,
       result: Seq[Signal])

   case class Signal(
       rtt: Double
   )

  case class Row(
     af:     String,
     from:   String,
     size:   String,
     result: Seq[Hop]
  )
 def main(args: Array[String]): Unit = {

//create configuration
val conf = new SparkConf().setAppName("my first rdd app").setMaster("local")

//create spark context
val sc = new SparkContext(conf)

// find absolute path of json file
val pathToTraceroutesExamples = getClass.getResource("/test/rttAnalysis_sample_0.json")


//create spark session
val spark = SparkSession
  .builder()
  .config(conf)
  .getOrCreate()
import spark.implicits._

//read json file
val logData = spark.read.json(pathToTraceroutesExamples.getPath)

// create a dataset of Row
val datasetLogdata = logData.select("af", "from", "size", "result").as[Row]

//count dataset elements
val count = datasetLogdata.rdd.count()
println(count)
}}

问题:如何创建一个包含Row cas类列表并仅获取重要数据的RDD(因为在我的情况下JSON对象包含许多未使用的数据)

0 个答案:

没有答案