如何使用scala在strech中检查RDD [Map [Int,String]]中所有地图中的特定值?

时间:2014-11-28 05:04:50

标签: scala collections mapreduce

我想使用scala在strech中检查RDD [Map [Int,String]]中所有地图中的特定值。我的csv文件是,

Map(0 -> sunny, 1 -> hot, 2 -> high, 3 -> false, 4 -> no)
Map(0 -> sunny, 1 -> hot, 2 -> high, 3 -> true, 4 -> no)
Map(0 -> overcast, 1 -> hot, 2 -> high, 3 -> false, 4 -> yes)
Map(0 -> rainy, 1 -> mild, 2 -> high, 3 -> false, 4 -> yes)
Map(0 -> rainy, 1 -> cool, 2 -> normal, 3 -> false, 4 -> yes)

在这里,我想检查每张地图中的所有最后一个值,即不,不,是,是,是,具有特定值检查(是/否)在一段时间内。

1 个答案:

答案 0 :(得分:2)

scala> val a = List(Map(0 -> "sunny", 1 -> "hot", 2 -> "high", 3 -> "false", 4 -> "no"),
     |   Map(0 -> "sunny", 1 -> "hot", 2 -> "high", 3 -> "true", 4 -> "no"),
     |   Map(0 -> "overcast", 1 -> "hot", 2 -> "high", 3 -> "false", 4 -> "yes"),
     |   Map(0 -> "rainy", 1 -> "mild", 2 -> "high", 3 -> "false", 4 -> "yes"),
     |   Map(0 -> "rainy", 1 -> "cool", 2 -> "normal", 3 -> "false", 4 -> "yes"))
a: List[scala.collection.immutable.Map[Int,String]] = List(Map(0 -> sunny, 1 -> hot, 2 -> high, 3 -> false, 4 -> no), Map(0 -> sunny, 1 -> hot, 2 -> high, 3 -> true, 4 -> no), Map(0 -> overcast, 1 -> hot, 2 -> high, 3 -> false, 4 -> yes), Map(0 -> rainy, 1 -> mild, 2 -> high, 3 -> false, 4 -> yes), Map(0 -> rainy, 1 -> cool, 2 -> normal, 3 -> false, 4 -> yes))

scala> sc.parallelize(a)
res0: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,String]] = ParallelCollectionRDD[0] at parallelize at <console>:15

scala> val l = sc.parallelize(a)
l: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,String]] = ParallelCollectionRDD[1] at parallelize at <console>:14

scala> def check( s : String) : Boolean = if (s.equals("yes")) true else false
check: (s: String)Boolean

scala> val res = l.map{ x => check(x(4)) }
res: org.apache.spark.rdd.RDD[Boolean] = MappedRDD[4] at map at <console>:18


14/11/28 00:18:47 INFO DAGScheduler: Stage 5 (take at <console>:21) finished in 0.020 s
14/11/28 00:18:47 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 
14/11/28 00:18:47 INFO DAGScheduler: Job 5 finished: take at <console>:21, took 0.026501 s
false
false
true
true
true

<强>更新 仅当所有值均为true时,以下内容才为true,否则为false

scala> res.reduce( _ && _ )