我想使用flatMap来实现filter()+ map(),如下面的代码: 输出一个Tuple2有三个if语句。 否则将输出一个空数组[Tuple2]
你有更优雅的方式来实现这个功能吗?
rddx.flatMap { case (arr: Array[String]) =>
val url_parts = arr(1).split("/")
if (url_parts.length > 7) {
val pid = url_parts(4)
val lid = url_parts(7).split("_")
if (lid.length == 2) {
val sid = lid(0)
val eid = lid(1)
if (eid.length > 0 && eid(0) == "h") {
Array((pid, 1))
}
else new Array[(String, Int)](0)
}
else Array((pid, 1))
}
else new Array[(String, Int)](0)
}
答案 0 :(得分:5)
你可以使用for-understanding。当然,这将成为flatMap
,map
,filter
的链,但Spark无论如何都会在一个阶段对其进行分组,因此不会有任何性能损失。
for {
arr <- rddx
url_parts = arr(1).split("/")
if url_parts.length > 7
pid = url_parts(4)
lid = url_parts(7).split("_")
if lid.length == 2
sid = lid(0)
eid = lid(1)
if eid.length > 0 && eid(0) == "h"
} yield
Array((pid, 1))
以下是toDebugString
的输出,以显示只有一个阶段
scala> res.toDebugString
res2: String =
(8) MapPartitionsRDD[7] at map at <console>:24 []
| MapPartitionsRDD[6] at filter at <console>:24 []
| MapPartitionsRDD[5] at map at <console>:24 []
| MapPartitionsRDD[4] at filter at <console>:24 []
| MapPartitionsRDD[3] at map at <console>:24 []
| MapPartitionsRDD[2] at filter at <console>:24 []
| MapPartitionsRDD[1] at map at <console>:24 []
| ParallelCollectionRDD[0] at parallelize at <console>:21 []
答案 1 :(得分:2)
“适合工作的工具”。 在这种情况下,所有解析都可以使用正则表达式完成:
val pidCapture = "[\\w]+/[\\w]+/[\\w]+/([\\w]+)/[\\w]+/[\\w]+/[^_]+_h[\\w]+.*".r
rdd.map(arr => arr(1)).collect { case pidCapture(pid) => (pid,1) }
repl上的示例,从URL作为字符串离开:
val urls = List("one/two/three/pid1/four/five/six/sid_heid", "one/two/three/pid2/four/five/six/sid_noth", "one/two/three/pid3/four/five", "one/two/three/pid4/four/five/six/sid_heid/more")
val rdd = sc.parallelize(urls)
val regex = "[\\w]+/[\\w]+/[\\w]+/([\\w]+)/[\\w]+/[\\w]+/[^_]+_h[\\w]+.*".r
val pids = rdd.collect{ case regex(pid) => (pid,1)}
val result = pids.collect()
result: Array[(String, Int)] = Array((pid1,1), (pid4,1))