我有两个文件。
第一个文件包含:(假设苹果是关键,水果是价值)
apple
banana
mango
potato
tomato
第二个文件包含:
fruit apple,banana,mango
vegetable potato,tomato
我需要遍历第二个文件并在文件1中找到匹配值。 我需要最终输出为:(水果是关键,苹果,香蕉......是价值)
CREATE TRIGGER before_insert_task BEFORE INSERT
ON ticket FOR EACH ROW
BEGIN
UPDATE ticket
SET technician = NEW.technician,
date = NEW.date,
closeDate = NEW.closeDate,
solution = NEW.solution,
content = NEW.content,
actiontime = NEW.actiontime;
END
请建议我使用spark和scala进行最佳和优化的方法。
答案 0 :(得分:2)
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
val one = List(("apple","fruit"), ("banana","fruit"), ("tomato","vegetable"),
("mango", "fruit"), ("potato","vegetable"))
val oneRdd = sc.makeRDD(one, 1)
//maybe a Broadcast for this
val two = List("apple", "banana", "tomato", "mango", "potato")
val res = oneRdd.filter(two contains _._1).map(t=>(t._2,List(t._1))).reduceByKey{_++_}
编辑:和一个完全适用于RDD的版本,因此file1和file2可以是非常大的(尽管如果file2很大,它可能包含重复项,所以你可能每次都需要.distinct
{{1} }})
reduceByKey
两者的输出相同:
val oneRdd = sc.makeRDD(one, 1)
val twoRdd = sc.makeRDD(two, 1).map(a=>(a, a)) // to make a PairRDD
val res = oneRdd.join(twoRdd).map{case(k,(v1, v2))=>(v1, List(k))}.reduceByKey{_++_}
答案 1 :(得分:1)
val inputRDD1 = sc.textFile("file1.txt").map(r=> {
val arr = r.split(" ")
(arr(0), arr(1))
})
val inputRDD2 = sc.textFile("file2.txt")
val broadcastRDD = sc.broadcast(inputRDD1.collect.toList.toMap)
val interRDD = inputRDD2.map(r => (broadcastRDD.value.get(r), r))
val outputRDD = interRDD.groupByKey
输出
res16: Array[(String, Iterable[String])] = Array((fruit,CompactBuffer(apple, banana, mango)), (vegetable,CompactBuffer(potato, tomato)))
答案 2 :(得分:0)
>>> d=[('apple','fruit'),('banana','fruit'),('tomato','veg'),('mango','fruit'),(
'potato','veg')]
>>> r = sc.parallelize(d)
>>> r1=r.map(lambda x: (x[1],x[0])).groupByKey()
>>> for i in r1.collect():
... print "%s %s" %(i[0],list(i[1]))
veg ['tomato', 'potato']
fruit ['apple', 'banana', 'mango']