Question

我正在比较两个代码（Python中的顺序代码，PySpark分布式代码），这些代码可能会产生相同的结果。范围是在图G（V，E）中找到连通分量的三元组数。 PySpark实施的结果部分正确，因为未检测到某些三元组。

我比较了输入数据（Python版本：RDD中的两列pandas dataframe，PySpark版本：tuple和dict broadcast在最终计数操作之前它们是一致的。这是以下PySpark脚本：

col_a = sc.parallelize(df.a).zipWithIndex().map(lambda (x,y): (y,x)) #df.a: a series of int (nodes)
col_b = sc.parallelize(df.b).zipWithIndex().map(lambda (x,y): (y,x)) #df.b: a series of int (nodes)
# remove index and move the smaller value on left side 
edges = col_a.join(col_b).map(lambda (i,(x,y)): (x,y) if x < y else (y,x))  

neighborhood = edges.groupByKey().sortByKey().mapValues(set).map(lambda (x,y): (x,sorted(y)) if x not in y else (x,(sorted(y))[1:]))
par_neighborhood = neighborhood.cache()
br_neighborhood = sc.broadcast(neighborhood.collectAsMap())

def f(vertex):
    search = vertex[1] 
    for i in search:
        increment = set(search).intersection(br_neighborhood.value.get(i, []))
        if len(increment) > 0 :
            return vertex[0], i, increment
        else:
            return 0

triplets = par_neighborhood.map(f).collect()

顺序版本基于相同的算法，neighborhood RDD collect() PySpark用于检查一致性与序列代码具有相同的值。但是，在对triplets实施所产生的PySpark进行评估时，有些内容会丢失。

到目前为止，我无法发现我的错误。

编辑：

我改变了函数f并用foreachPartition(f)调用了它;结果现在一致，计算速度更快。

count = sc.accumulator(0)

def f(iterator): #few f 
    global count 
    for vertex_n in iterator:
        for i in vertex_n[1]:
            increment = set(vertex_n[1]).intersection(br_neighborhood.value.get(i, []))
            if len(increment) > 0:
                count.add(len(increment))

par_neighborhood.foreachPartition(f) #call

PySpark收集操作的不一致 - 图算法

0 个答案: