Spark Scala-嵌套案例类的检查字段

时间:2018-11-17 08:38:20

标签: json scala apache-spark

我有以下三个案例类:

case class Result(
   result: Seq[Signal],
   hop:    Int)

case class Signal(
   rtt:  Double,
   from: String)

case class Traceroute(
  dst_name:  String,
  from:      String,
  prb_id:    BigInt,
  msm_id:    BigInt,
  timestamp: BigInt,
  result:    Seq[Result])

跟踪路线具有字段result,该字段是结果的序列。每个结果都是信号的序列。

我尝试检查Result的字段是否为负。 我的json记录如下:

{"prb_id": 4247, "result": [{"result": [{"rtt": 1.955, "ttl": 255, "from": "89.105.200.57", "size": 28}, {"rtt": 1.7, "ttl": 255, "from": "10.10.0.5", "size": 28}, {"rtt": 1.709, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 1}]}

为清楚起见,我在json记录中省略了一些属性。 结果属性是Traceroute案例类中的结果字段。

我使用了一个过滤器,通过使用过滤器来检查信号中的rtt是否为负,但是我没有期望的结果。

val checkrtts = checkError.filter(x => x.result.foreach(p => p.result.foreach(f => checkSignal(f))))

checkSignal函数如下:

def checkSignal(signal: Signal): Signal = {
  if (signal.rtt > 0) {
    return signal
  } else {
    return null
  }

}

给出两个Traceroute实例的示例:

{"timestamp": 1514768409, "result": [{"result": [{"rtt": 1.955, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 1}]}
{"timestamp": 1514768402, "result": [{"result": [{"rtt": -2.5, "ttl": 255, "from": "89.105.200.57", "size": 28},{"rtt": 19.955, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 2}]}

对于第一个Traceroute,将不应用任何更改。 对于第二个Traceroute,result.result字段具有两个元素(类型为Signal),第一个Signal具有负rtt,因此我应从result.result中删除此Signal。但是不应删除第二个信号。

结果,输出应如下:

{"timestamp": 1514768409, "result": [{"result": [{"rtt": 1.955, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 1}]}
{"timestamp": 1514768402, "result": [{"result": [{"rtt": 19.955, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 2}]}

请任何帮助。我是Spark和Scala的新手。我尝试了许多方法,但结果与预期不符。

1 个答案:

答案 0 :(得分:0)

对于过滤器功能应该做的事情,您似乎有些误解。它从返回Traceroute的数据集中过滤掉整个false对象。您需要做的是编写一个map函数,它将原始Traceroute对象转换为所需的对象。以下是有关如何为Dataset[Traceroute]

进行操作的示例示例

首先,您需要稍微修改案例类,如下所示。

case class Result(var result: Seq[Signal],
                   hop:    Int)

case class Signal(rtt:  Double,
                   from: String)

case class Traceroute( dst_name:  String,
                       from:      String,
                       prb_id:    BigInt,
                       msm_id:    BigInt,
                       timestamp: BigInt,
                       result:    Seq[Result])

如您所见,我已经将var添加到result类的Result字段中。这将有助于我们稍后在您自定义函数中修改result字段,并将其传递给map操作

然后定义以下两个函数:

def checkSignal(signal: Signal): Boolean = {
    if (signal.rtt > 0) {
      return true
    } else {
      return false
    }

  }

 def removeNegative(traceroute: Traceroute): Traceroute = {

    val outerList = traceroute.result
    for( temp <- outerList){

      val innerList = temp.result
      //here we are filtering the list to only contain nonnegative elements
      val newinnerList = innerList.filter(checkSignal(_))
      //here we are reassigning the newlist to result
      temp.result = newinnerList

    }

    traceroute
  }

现在,我们将从映射后的原始数据集中映射原始数据集,在那里我们可以正确接收过滤后的列表。

val dataPath = "hdfs/path/to/traceroute.json"
val tracerouteSchema = ScalaReflection.schemaFor[Traceroute].dataType.asInstanceOf[StructType]
val dataset = spark.read.schema(tracerouteSchema).json(dataPath).as[Traceroute]

println("Showing 10 rows of original Dataset")
dataset.show(10, truncate = false)

val maprtts = dataset.map(x => removeNegative(x))


println("Showing 10 rows of transformed dataset")
maprtts.show(10, truncate = false)

以下是输出:

Showing 10 rows of original dataset
+--------+----+------+------+----------+-------------------------------------------------------+
|dst_name|from|prb_id|msm_id|timestamp |result                                                 |
+--------+----+------+------+----------+-------------------------------------------------------+
|null    |null|null  |null  |1514768409|[[[[1.955, 89.105.200.57]], 1]]                        |
|null    |null|null  |null  |1514768402|[[[[-2.5, 89.105.200.57], [19.955, 89.105.200.57]], 2]]|
+--------+----+------+------+----------+-------------------------------------------------------+

Showing 10 rows of transformed dataset
+--------+----+------+------+----------+--------------------------------+
|dst_name|from|prb_id|msm_id|timestamp |result                          |
+--------+----+------+------+----------+--------------------------------+
|null    |null|null  |null  |1514768409|[[[[1.955, 89.105.200.57]], 1]] |
|null    |null|null  |null  |1514768402|[[[[19.955, 89.105.200.57]], 2]]|
+--------+----+------+------+----------+--------------------------------+