Spark RDD:从其他RDD查找

时间:2018-08-20 01:04:37

标签: apache-spark pyspark rdd

作为一些练习“我自己的关联规则”模块的一部分,我正在尝试在Spark中执行最快的查找。请注意,我知道PySpark支持以下指标(置信度)。这只是一个示例-不支持另一个指标(升幅),但我打算使用这次讨论的结果来发展这一指标。

作为计算规则置信度的一部分,我需要查看先行和后继一起出现的频率,以及前行在整个交易集中发生的频率(在这种情况下,> python3 test.py {(2, 2, 3), (2, 6), (3, 4)} {(5, 5)} {(3, 3, 9), (3, 3, 3, 3), (3, 27), (9, 9)} > )。

rdd

对于那些熟悉这种查找问题的人,您会知道会引发这种类型的异常:

from itertools import combinations, chain

def powerset(iterable, no_empty=True):
    ''' Produce the powerset for a given iterable '''
    s = list(iterable)
    combos = (combinations(s, r) for r in range(len(s)+1))
    powerset = chain.from_iterable(combos)
    return (el for el in powerset if el) if no_empty else powerset

# Set-up transaction set
rdd = sc.parallelize(
    [
        ('a',),
        ('a', 'b'),
        ('a', 'b'),
        ('b', 'c'),
        ('a', 'c'),
        ('a', 'b'),
        ('b', 'c'),
        ('c',),
        ('b'),
    ]
)

# Create an RDD with the counts of each
# possible itemset
counts = (
    rdd
    .flatMap(lambda x: powerset(x))
    .map(lambda x: (x, 1))
    .reduceByKey(lambda x, y: x + y)
    .map(lambda x: (frozenset(x[0]), x[1]))
)

# Function to calculate confidence of a rule
confidence = lambda x: counts.lookup(frozenset(x)) / counts.lookup((frozenset(x[1]),))

confidence_result = (
    rdd
    # Must be applied to length-two and greater itemsets
    .filter(lambda x: len(x) > 1)
    .map(confidence)
)

一种解决此异常的方法是将Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. 转换为字典:

counts

这给了我我的结果。但是运行counts = dict(counts.collect()) confidence = lambda x: (x, counts[frozenset(x)] / counts[frozenset(x[1])]) confidence_result = ( rdd # Must be applied to length-two and greater itemsets .filter(lambda x: len(x) > 1) .map(confidence) ) 的过程非常昂贵,因为实际上我有一个包含50m +条记录的数据集。有没有更好的选择来执行这种类型的查找?

1 个答案:

答案 0 :(得分:0)

如果可以在每个RDD分区上独立计算目标指标,然后将其组合以获得目标结果,则在计算指标时可以使用mapPartitions而不是map

通用流程应类似于:

metric_result = (
    rdd
    # apply your metric calculation independently on each partition       
    .mapPartitions(confidence_partial) 
    # collect results from the partitions into a single list of results
    .collect()
    # reduce the list to combine the metrics calculated on each partition
    .reduce(confidence_combine)
)

confidence_partialconfidence_combine都是使用迭代器/列表输入的常规python函数。

顺便说一句,通过使用数据框API和本机表达式函数来计算指标,您可能会获得巨大的性能提升。