Question

由于资源限制，我需要能够将大型RDD拆分为n个较小的RDD，并将它们作为单独的作业调用spark-submit。代码看起来像这样：

def split_rdd_by_key(input_rdd, distinct_key_count, num_splits=10):


# Calc. chunk indexes - lower and upper bounds for each smaller rdd
chunk_gtor = __chunk_points(distinct_key_count, num_splits)

smaller_rdds = []
sort_key = "data_key"

# Create sets of smaller rdds by filtering on indexes 
for item in chunk_gtor:
    lbval, ubval = item[0], item[1]

    print "lbval=%s ubval=%s" % (lbval, ubval)
    filt_rdd = input_rdd.filter(lambda x : x.key.entity >= lbval \
                                and x.key.entity <= ubval)
    filt_count=filt_rdd.count()
    print "filt_count=%s" % filt_count
    smaller_rdds.append(filt_rdd)

return smaller_rdds

上面的代码在生成时打印每个较小的rdd的大小，并将其附加到smaller_rdds列表。

但是，如果我运行上述功能：

filtered_rdds=split_rdd_by_key(rdd, distinct_key_count)

并执行以下操作：

# See the size of the 1st smaller rdd:
filtered_rdds[0].count()

它返回31，比运行函数split_rdd_by_key时打印的数量小得多!!

任何人都可以帮忙解释一下吗？我一定错过了什么。

Answer 1

我想出了这个问题。我需要拨打cache()：

filt_rdd = dataflow_rdd.filter(lambda x : x.key.entity >= lbval \
                                  and x.key.entity <= ubval)
filt_rdd.cache()

现在可行了： - ）

将RDD拆分为较小的RDD并将其存储在列表中时的奇怪行为

1 个答案: