Python - Write to multiple outputs by key Spark - one Spark job

时间:2015-07-28 16:21:28

标签: python apache-spark pyspark

How can one write to multiple outputs for each key in an RDD using Python and Spark in one job? I know I can try to use .filter for all the possible keys, but this is a lot of work which will create many jobs.

Similar to this question: Write to multiple outputs by key Spark - one Spark job

However the answer to the above question is in scala. Looking for a how-to using Python.

PATH = os.path.join("s3://asdf/hjkl", 'temp_date', "intermediate_data/")
global current_sport
current_sport = ''
def format_for_output(x):
    current_sport = x[0]
    return json.dumps(x[1])
recommendation2.map(format_for_output).saveAsTextFile(os.path.join(PATH, current_sport))

1 个答案:

答案 0 :(得分:1)

If you want plain Python solution then you can simply partition RDD by key. First lets create some dumy data:

import numpy as np
np.random.seed(1)

keys = [chr(x) for x in xrange(65, 91)]
rdd = sc.parallelize(
    (np.random.choice(keys), np.random.randint(0, 100)) for _ in xrange(10000))

Now lets pretend we don't know anything about the keys. We have to create mapping from key to a partition id:

mapping = sc.broadcast(
    rdd.keys(). # Get keys
        distinct(). # Find unique
        sortBy(lambda x: x). # Sort
        zipWithIndex(). # Add index
        collectAsMap()) # Create dict

Finally we can partition using above mapping and save to text file:

(rdd.
    partitionBy(
        len(mapping.value) # Number of partitions
        partitionFunc=lambda x: mapping.value.get(x) # Mapping
    ).saveAsTextFile("foo"))

Lets check if everything works as expected:

import glob

cnts = rdd.countByKey() # Count values by key
fs = sorted(glob.glob("foo/part-*")) # Get output names

assert len(fs) == len(mapping.value) # All keys present

for (k, v) in sorted(mapping.value.items()):
    with open(fs[v]) as fr:
        lines = fr.readlines()
        assert len(lines) == cnts[k] # Number of records as expected
        assert all(k in line for line in lines) # All with the same key