我有以下三个案例类:
case class Result(
result: Seq[Signal],
hop: Int)
case class Signal(
rtt: Double,
from: String)
case class Traceroute(
dst_name: String,
from: String,
prb_id: BigInt,
msm_id: BigInt,
timestamp: BigInt,
result: Seq[Result])
跟踪路线具有字段result
,该字段是结果的序列。每个结果都是信号的序列。
我尝试检查Result
的字段是否为负。
我的json记录如下:
{"prb_id": 4247, "result": [{"result": [{"rtt": 1.955, "ttl": 255, "from": "89.105.200.57", "size": 28}, {"rtt": 1.7, "ttl": 255, "from": "10.10.0.5", "size": 28}, {"rtt": 1.709, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 1}]}
为清楚起见,我在json记录中省略了一些属性。 结果属性是Traceroute案例类中的结果字段。
我使用了一个过滤器,通过使用过滤器来检查信号中的rtt是否为负,但是我没有期望的结果。
val checkrtts = checkError.filter(x => x.result.foreach(p => p.result.foreach(f => checkSignal(f))))
checkSignal函数如下:
def checkSignal(signal: Signal): Signal = {
if (signal.rtt > 0) {
return signal
} else {
return null
}
}
给出两个Traceroute实例的示例:
{"timestamp": 1514768409, "result": [{"result": [{"rtt": 1.955, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 1}]}
{"timestamp": 1514768402, "result": [{"result": [{"rtt": -2.5, "ttl": 255, "from": "89.105.200.57", "size": 28},{"rtt": 19.955, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 2}]}
对于第一个Traceroute,将不应用任何更改。 对于第二个Traceroute,result.result字段具有两个元素(类型为Signal),第一个Signal具有负rtt,因此我应从result.result中删除此Signal。但是不应删除第二个信号。
结果,输出应如下:
{"timestamp": 1514768409, "result": [{"result": [{"rtt": 1.955, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 1}]}
{"timestamp": 1514768402, "result": [{"result": [{"rtt": 19.955, "ttl": 255, "from": "89.105.200.57", "size": 28}], "hop": 2}]}
请任何帮助。我是Spark和Scala的新手。我尝试了许多方法,但结果与预期不符。
答案 0 :(得分:0)
对于过滤器功能应该做的事情,您似乎有些误解。它从返回Traceroute
的数据集中过滤掉整个false
对象。您需要做的是编写一个map函数,它将原始Traceroute
对象转换为所需的对象。以下是有关如何为Dataset[Traceroute]
首先,您需要稍微修改案例类,如下所示。
case class Result(var result: Seq[Signal],
hop: Int)
case class Signal(rtt: Double,
from: String)
case class Traceroute( dst_name: String,
from: String,
prb_id: BigInt,
msm_id: BigInt,
timestamp: BigInt,
result: Seq[Result])
如您所见,我已经将var
添加到result
类的Result
字段中。这将有助于我们稍后在您自定义函数中修改result
字段,并将其传递给map操作
然后定义以下两个函数:
def checkSignal(signal: Signal): Boolean = {
if (signal.rtt > 0) {
return true
} else {
return false
}
}
def removeNegative(traceroute: Traceroute): Traceroute = {
val outerList = traceroute.result
for( temp <- outerList){
val innerList = temp.result
//here we are filtering the list to only contain nonnegative elements
val newinnerList = innerList.filter(checkSignal(_))
//here we are reassigning the newlist to result
temp.result = newinnerList
}
traceroute
}
现在,我们将从映射后的原始数据集中映射原始数据集,在那里我们可以正确接收过滤后的列表。
val dataPath = "hdfs/path/to/traceroute.json"
val tracerouteSchema = ScalaReflection.schemaFor[Traceroute].dataType.asInstanceOf[StructType]
val dataset = spark.read.schema(tracerouteSchema).json(dataPath).as[Traceroute]
println("Showing 10 rows of original Dataset")
dataset.show(10, truncate = false)
val maprtts = dataset.map(x => removeNegative(x))
println("Showing 10 rows of transformed dataset")
maprtts.show(10, truncate = false)
以下是输出:
Showing 10 rows of original dataset
+--------+----+------+------+----------+-------------------------------------------------------+
|dst_name|from|prb_id|msm_id|timestamp |result |
+--------+----+------+------+----------+-------------------------------------------------------+
|null |null|null |null |1514768409|[[[[1.955, 89.105.200.57]], 1]] |
|null |null|null |null |1514768402|[[[[-2.5, 89.105.200.57], [19.955, 89.105.200.57]], 2]]|
+--------+----+------+------+----------+-------------------------------------------------------+
Showing 10 rows of transformed dataset
+--------+----+------+------+----------+--------------------------------+
|dst_name|from|prb_id|msm_id|timestamp |result |
+--------+----+------+------+----------+--------------------------------+
|null |null|null |null |1514768409|[[[[1.955, 89.105.200.57]], 1]] |
|null |null|null |null |1514768402|[[[[19.955, 89.105.200.57]], 2]]|
+--------+----+------+------+----------+--------------------------------+