How can one write to multiple outputs for each key in an RDD using Python and Spark in one job? I know I can try to use .filter for all the possible keys, but this is a lot of work which will create many jobs.
Similar to this question: Write to multiple outputs by key Spark - one Spark job
However the answer to the above question is in scala. Looking for a how-to using Python.
PATH = os.path.join("s3://asdf/hjkl", 'temp_date', "intermediate_data/")
global current_sport
current_sport = ''
def format_for_output(x):
current_sport = x[0]
return json.dumps(x[1])
recommendation2.map(format_for_output).saveAsTextFile(os.path.join(PATH, current_sport))
答案 0 :(得分:1)
If you want plain Python solution then you can simply partition RDD by key. First lets create some dumy data:
import numpy as np
np.random.seed(1)
keys = [chr(x) for x in xrange(65, 91)]
rdd = sc.parallelize(
(np.random.choice(keys), np.random.randint(0, 100)) for _ in xrange(10000))
Now lets pretend we don't know anything about the keys. We have to create mapping from key to a partition id:
mapping = sc.broadcast(
rdd.keys(). # Get keys
distinct(). # Find unique
sortBy(lambda x: x). # Sort
zipWithIndex(). # Add index
collectAsMap()) # Create dict
Finally we can partition using above mapping and save to text file:
(rdd.
partitionBy(
len(mapping.value) # Number of partitions
partitionFunc=lambda x: mapping.value.get(x) # Mapping
).saveAsTextFile("foo"))
Lets check if everything works as expected:
import glob
cnts = rdd.countByKey() # Count values by key
fs = sorted(glob.glob("foo/part-*")) # Get output names
assert len(fs) == len(mapping.value) # All keys present
for (k, v) in sorted(mapping.value.items()):
with open(fs[v]) as fr:
lines = fr.readlines()
assert len(lines) == cnts[k] # Number of records as expected
assert all(k in line for line in lines) # All with the same key