Question

我有一个问题，我必须在其中摄取大量数据，其中每个项目都是具有一个或多个标识符的人。摄取的要点是找到具有公共标识符的人并将它们连接在一起作为相同的标识符。这可能导致长链连接成一个人。基本上这都是关于找到一组巨大顶点的连通分量。现在的问题是，虽然我知道如何使用存储我可以迭代查询，通过逐个处理初始项并逐渐将它们合并在一起，我想在Spark中执行此操作。

问题是我只能想到问题的迭代解决方案，它涉及处理传递中的数据。

所以每次通过我都会：

按相同属性对人进行分组
如果没有组的大小＆gt; 1站，其他
将他们合并为一个人，从头开始重复

现在，传递次数完全取决于数据。事实上：

P1 a1 a2
P2 a2 a3
P3 a3 a4
...
P10000 a10000 a1

在这种情况下，预期的输出将是（因为它们都是连接的）：

P1 a1 a2 a3 ... a10000

如果我有上面的布局，其中 P 是人， a 属性，这将导致....也许log（N）传递，因为它加入他们通过不断减少人数并最终将最后一个与第一个连接起来？有没有办法以并行和更快的方式来解决这个问题？

Answer 1

您可以使用GraphX / Graphframes软件包中的聚合消息或预凝胶算法的某些变体。

我使用以下内容在一个巨大的图中检测周期

import datetime
import time
import re
from datetime import datetime, timedelta
start_time =datetime.now()


# 3) load pyspark (spark) and graphframe (graphx) modules
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql import Row

import pyspark.sql.functions as f
from graphframes import GraphFrame
from graphframes.lib import *
AM=AggregateMessages


def find_cycles(spark,sc,edges,max_iter=100000):

    # Cycle detection via message aggregation
    """
    The basic idea:
    ===============
    We propose a general algorithm for detecting cycles in a directed graph G by message passing among its vertices, 
    based on the bulk synchronous message passing abstraction. This is a vertex-centric approach in which the vertices 
    of the graph work together for detecting cycles. The bulk synchronous parallel model consists of a sequence of iterations, 
    in each of which a vertex can receive messages sent by other vertices in the previous iteration, and send messages to other 
    vertices.
    In each pass, each active vertex of G sends a set of sequences of vertices to its out- neighbours as described next. 
    In the first pass, each vertex v sends the message (v) to all its out- neighbours. In subsequent iterations, each active vertex v 
    appends v to each sequence it received in the previous iteration. It then sends all the updated sequences to its out-neighbours. 
    If v has not received any message in the previous iteration, then v deactivates itself. The algorithm terminates when all the 
    vertices have been deactivated.
    For a sequence (v1, v2, . . . , vk) received by vertex v, the appended sequence is not for- warded in two cases: (i) if v = v1, 
    then v has detected a cycle, which is reported (see line 9 of Algorithm 1); (ii) if v = vi for some i ∈ {2, 3, . . . , k}, 
    then v has detected a sequence that contains the cycle (v = vi, vi+1, . . . , vk, vk+1 = v); in this case, 
    the sequence is discarded, since the cycle must have been detected in an earlier iteration (see line 11 of Algorithm 1); 
    to be precise, this cycle must have been detected in iteration k − i + 1. Every cycle (v1, v2, . . . , vk, vk+1 = v1) 
    is detected by all vi,i = 1 to k in the same iteration; it is reported by the vertex min{v1,...,vk} (see line 9 of Algorithm 1).
    The total number of iterations of the algorithm is the number of vertices in the longest path in the graph, plus a few more steps 
    for deactivating the final vertices. During the analysis of the total number of iterations, we ignore the few extra iterations 
    needed for deactivating the final vertices and detecting the end of the computation, since it is O(1).
    
    Pseudocode of the algorithm:
    ============================
    M(v): Message received from vertex v
    N+(v): all dst verties from v

    functionCOMPUTE(M(v)):
        if i=0 then:
            for each w ∈ N+(v) do:
                send (v) to w 
        else if M(v) = ∅ then:
                deactivate v and halt 
        else:
            for each (v1,v2,...,vk) ∈ M(v) do:
                if v1 = v and min{v1,v2,...,vk} = v then:
                    report (v1 = v,v2,...,vk,vk+1 = v)
                else if v not ∈ {v2,...,vk} then:
                    for each w ∈ N+(v) do:
                        send (v1,v2,...,vk,v) to w

    
    Scalablitiy of the algorithm:
    ============================
    the number of iteration depends on the path of the longest cycle
    the scaling it between 
    O(log(n)) up to maxium O(n) where n=number of vertices
    so the number of iterations is less to max linear to the number of vertices, 
    if there are more edges (parallel etc.) it will not affect the the runtime


    for more details please refer to the oringinal publication
    """

    print("+++ find_cycles(): starting cycle search ...")   

    
    # initialize the message column with own source id 
    init_vertices=(
        edges.select("src").union(edges.select("dst")).distinct().withColumnRenamed('src', 'id')
        .withColumn("message",f.array(f.col("id")))
        )
    
    init_edges=(
        edges
        .where(f.col("src")!=f.col("dst"))
        .select("src","dst")
        )
    
    # create emtpy dataframe to collect all cycles
    cycles = spark.createDataFrame(sc.emptyRDD(),StructType([StructField("cycle",ArrayType(StringType()),True)]))

    # create graph object that will be update each iteration
    gx = GraphFrame(init_vertices, init_edges)

    # iterate until max_iter 
    # max iter is used in case that the break condition is never reached during this time
    # defaul value=100.000
    loop_start_time =time.time()
    loop_iter_end_time=loop_start_time
    loop_iter_start_time=loop_start_time


    for iter_ in range(max_iter):

        print("+++ find_cycles(): iteration step= " + str(iter_) + " with loop time = " + str(round(time.time()-loop_start_time)) + " seconds (Delta = " +str(round(loop_iter_end_time-loop_iter_start_time))+")")
       

        loop_iter_start_time=round(time.time())

        # message that should be send to destination for aggregation
        msgToDst = AM.src["message"]
        # aggregate all messages that where received into a python set (drops duplicate edges)
        agg = gx.aggregateMessages(
            f.collect_set(AM.msg).alias("aggMess"),
            sendToSrc=None,
            sendToDst=msgToDst)
        
        # BREAK condition: if no more messages are received all cycles where found 
        # and we can quit the loop  

        if(agg.rdd.isEmpty()):
            print("THE END: All cycles found in " + str(iter_) + " iterations")
           break
        
        # aggMessages=agg.count()
        # apply the alorithm logic 
        # filter for cycles that should be reported as found
        # compose new message to be send for next iteration
        # _column name stands for temporary columns that are only used in the algo and then dropped again
        checkVerties=(
            agg
            # flatten the aggregated message from [[1]] to [] in order to have proper 1D arrays
            .withColumn("_flatten1",f.explode(f.col("aggMess")))
            # take first element of the array
            .withColumn("_first_element_agg",f.element_at(f.col("_flatten1"), 1))
            # take minimum element of th array
            .withColumn("_min_agg",f.array_min(f.col("_flatten1")))
            # check if it is a cycle 
            # it is cycle when v1 = v and min{v1,v2,...,vk} = v
            .withColumn("_is_cycle",f.when(
                (f.col("id")==f.col("_first_element_agg")) &
                (f.col("id")==f.col("_min_agg"))
                 ,True)
                .otherwise(False)
            )
            # pick cycle that should be reported=append to cylce list
            .withColumn("_cycle_to_report",f.when(f.col("_is_cycle")==True,f.col("_flatten1")).otherwise(None))
            # sort array to have duplicates the same
            #.withColumn("_cycle_to_report",f.sort_array("_cycle_to_report"))
            # create column where first array is removed to check if the current vertices is part of v=(v2,...vk)
            .withColumn("_slice",f.array_except(f.col("_flatten1"), f.array(f.element_at(f.col("_flatten1"), 1)))) 
            # check if vertices is part of the slice and set True/False column
            .withColumn("_is_cycle2",f.lit(f.size(f.array_except(f.array(f.col("id")), f.col("_slice"))) == 0))
           )
        
        #print("checked Vertices")
        #checkVerties.show(truncate=False)
        # append found cycles to result dataframe via union
        # cache new cyclesds using workaround for SPARK-1334
        cachedCycles = AM.getCachedDataFrame(cycles)
        cachedCycles.count() # materialize it

        cycles=(
            # take existing cycles dataframe
            cachedCycles
            .union(
                # union=append all cyles that are in the current reporting column
                checkVerties
                .where(f.col("_cycle_to_report").isNotNull())
                .select("_cycle_to_report")
                )
        )

        # create list of new messages that will be send in the next iteration to the vertices
        newVertices=(
            checkVerties
            # append current vertex id on position 1
            .withColumn("message",f.concat(
                f.coalesce(f.col("_flatten1"), f.array()),
                f.coalesce(f.array(f.col("id")), f.array())
            ))
            # only send where it is no cycle duplicate
            .where(f.col("_is_cycle2")==False)
            .select("id","message")
        )

        #print("vertics to send forward")
        #newVertices.sort("id").show(truncate=False)
       

        # cache new vertices using workaround for SPARK-1334
        cachedNewVertices = AM.getCachedDataFrame(newVertices)
        cachedNewVertices.count() # materialize it

        # update graphframe object for next round
        gx = GraphFrame(cachedNewVertices, gx.edges)

        loop_iter_end_time =time.time()


    

    loop_iter_start_time=time.time()
    end_time =datetime.now()
   



    return cycles

使用Spark（或任何并行算法）进行图形循环检测？

1 个答案: