是否有更优雅的方式来实现滤镜+地图火花功能

时间:2015-04-28 13:01:40

标签: scala apache-spark

我想使用flatMap来实现filter()+ map(),如下面的代码: 输出一个Tuple2有三个if语句。 否则将输出一个空数组[Tuple2]

你有更优雅的方式来实现这个功能吗?

 rddx.flatMap { case (arr: Array[String]) =>
          val url_parts = arr(1).split("/")
          if (url_parts.length > 7) {
            val pid = url_parts(4)
            val lid = url_parts(7).split("_")
            if (lid.length == 2) {
              val sid = lid(0)
              val eid = lid(1)
              if (eid.length > 0 && eid(0) == "h") {
                Array((pid, 1))
              }
              else new Array[(String, Int)](0)
            }
            else Array((pid, 1))
          }
          else new Array[(String, Int)](0)
         }

2 个答案:

答案 0 :(得分:5)

你可以使用for-understanding。当然,这将成为flatMapmapfilter的链,但Spark无论如何都会在一个阶段对其进行分组,因此不会有任何性能损失。

for {
  arr <- rddx
  url_parts = arr(1).split("/")
  if url_parts.length > 7
  pid = url_parts(4)
  lid = url_parts(7).split("_")
  if lid.length == 2
  sid = lid(0)
  eid = lid(1)
  if eid.length > 0 && eid(0) == "h"
} yield 
  Array((pid, 1))

以下是toDebugString的输出,以显示只有一个阶段

scala> res.toDebugString
res2: String = 
(8) MapPartitionsRDD[7] at map at <console>:24 []
 |  MapPartitionsRDD[6] at filter at <console>:24 []
 |  MapPartitionsRDD[5] at map at <console>:24 []
 |  MapPartitionsRDD[4] at filter at <console>:24 []
 |  MapPartitionsRDD[3] at map at <console>:24 []
 |  MapPartitionsRDD[2] at filter at <console>:24 []
 |  MapPartitionsRDD[1] at map at <console>:24 []
 |  ParallelCollectionRDD[0] at parallelize at <console>:21 []

答案 1 :(得分:2)

“适合工作的工具”。 在这种情况下,所有解析都可以使用正则表达式完成:

val pidCapture = "[\\w]+/[\\w]+/[\\w]+/([\\w]+)/[\\w]+/[\\w]+/[^_]+_h[\\w]+.*".r
rdd.map(arr => arr(1)).collect { case pidCapture(pid) => (pid,1) }

repl上的示例,从URL作为字符串离开:

val urls = List("one/two/three/pid1/four/five/six/sid_heid", "one/two/three/pid2/four/five/six/sid_noth", "one/two/three/pid3/four/five", "one/two/three/pid4/four/five/six/sid_heid/more")
val rdd = sc.parallelize(urls)
val regex = "[\\w]+/[\\w]+/[\\w]+/([\\w]+)/[\\w]+/[\\w]+/[^_]+_h[\\w]+.*".r
val pids = rdd.collect{ case regex(pid) => (pid,1)}
val result = pids.collect()
result: Array[(String, Int)] = Array((pid1,1), (pid4,1))