火花中的键值映射

时间:2015-06-09 08:07:55

标签: scala apache-spark

我有两个文件。

第一个文件包含:(假设苹果是关键,水果是价值)

apple
banana
mango
potato
tomato

第二个文件包含:

fruit    apple,banana,mango
vegetable    potato,tomato

我需要遍历第二个文件并在文件1中找到匹配值。 我需要最终输出为:(水果是关键,苹果,香蕉......是价值)

CREATE TRIGGER before_insert_task BEFORE INSERT
ON ticket FOR EACH ROW
BEGIN
    UPDATE ticket
    SET technician = NEW.technician,
    date = NEW.date,
    closeDate = NEW.closeDate,
    solution = NEW.solution,
    content = NEW.content,
    actiontime = NEW.actiontime;
END

请建议我使用spark和scala进行最佳和优化的方法。

3 个答案:

答案 0 :(得分:2)

val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)

val one = List(("apple","fruit"), ("banana","fruit"), ("tomato","vegetable"),
               ("mango", "fruit"), ("potato","vegetable"))
val oneRdd = sc.makeRDD(one, 1)

//maybe a Broadcast for this
val two = List("apple", "banana", "tomato", "mango", "potato")

val res = oneRdd.filter(two contains _._1).map(t=>(t._2,List(t._1))).reduceByKey{_++_}
编辑:和一个完全适用于RDD的版本,因此file1和file2可以是非常大的(尽管如果file2很大,它可能包含重复项,所以你可能每次都需要.distinct {{1} }})

reduceByKey

两者的输出相同:

val oneRdd = sc.makeRDD(one, 1)

val twoRdd = sc.makeRDD(two, 1).map(a=>(a, a)) // to make a PairRDD

val res = oneRdd.join(twoRdd).map{case(k,(v1,  v2))=>(v1, List(k))}.reduceByKey{_++_}

答案 1 :(得分:1)

val inputRDD1 = sc.textFile("file1.txt").map(r=> {
    val arr = r.split(" ")
    (arr(0), arr(1))
})

val inputRDD2 = sc.textFile("file2.txt")

val broadcastRDD = sc.broadcast(inputRDD1.collect.toList.toMap)

val interRDD = inputRDD2.map(r => (broadcastRDD.value.get(r), r))

val outputRDD = interRDD.groupByKey

输出

res16: Array[(String, Iterable[String])] = Array((fruit,CompactBuffer(apple, banana, mango)), (vegetable,CompactBuffer(potato, tomato)))

答案 2 :(得分:0)

>>> d=[('apple','fruit'),('banana','fruit'),('tomato','veg'),('mango','fruit'),(
'potato','veg')]
>>> r = sc.parallelize(d)
>>> r1=r.map(lambda x: (x[1],x[0])).groupByKey()
>>> for i in r1.collect():
...     print "%s  %s" %(i[0],list(i[1]))

veg  ['tomato', 'potato']
fruit  ['apple', 'banana', 'mango']