如何通过密钥加入两个RDD?

时间:2015-11-22 15:20:34

标签: python hadoop apache-spark pyspark

animals_population_file = sc.textFile("input/myFile1.txt")
animals_place_file = sc.textFile("input/myFile2.txt")

animals_population_file:

Dogs, 5
Cats, 6

animals_place_file:

Dogs, Italy
Cats, Italy
Dogs, Spain

现在我想以动物类型为关键加入animals_population_fileanimals_place_file。 结果应该是这个:

Dogs, [Italy, Spain, 5]
Cats, [Italy, 6]

我试过了joined = animals_population_file.join(animals_place_file),但我不知道如何定义密钥。此外,当我运行joined.collect()时,它会给我一个错误:

    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling o247.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 29, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/worker.py", line 101, in main
    process()
  File "/usr/lib/spark/python/pyspark/worker.py", line 96, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/lib/spark/python/pyspark/serializers.py", line 236, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/lib/spark/python/pyspark/rdd.py", line 1807, in <lambda>
    map_values_fn = lambda (k, v): (k, f(v))
ValueError: too many values to unpack

1 个答案:

答案 0 :(得分:1)

运行textFile时没有PairRdd(基于评论中的rdds内容)。 要进行连接,您需要使用PairRDD。 因此,将您的输入转换为pairRDDs

val rdd1 = sc.textFile("input/myFile1.txt")
val rdd2 = sc.textFile("input/myFile2.txt")

val data1 = rdd1.map(line => line.split(",").map(elem => elem.trim))
val data2 = rdd2.map(line => line.split(",").map(elem => elem.trim))

val pairRdd1 = data1.map(r => (r(0), r))  /** zero index is the animal type which is the key in file 1*/
val pairRdd2 = data2.map(r => (r(0), r))  /** zero index is the animal type which is the key in file 2 as well */

val joined = pairRdd1.join(pairRdd2)

val local = joined.collect()
local.foreach{case (k, v) => {
  print(k + " : ")
  println(v._1.mkString("|") + "|" + v._2.mkString("|"))
}}