在UDF函数pyspark中使用对象方法

时间:2020-04-12 12:49:14

标签: apache-spark pyspark apache-spark-sql

我正在尝试基于复杂的计算(在类内部的方法中)更新数据框的列。 根据到目前为止的经验,您可以使用用户定义的函数来更新数据框中的列。不幸的是,用户定义的函数必须是静态的。 有什么解决方法吗?

这是我代码的相关部分:

'''
Louvain Community Detection Algoritm
'''
class LouvainCommunityDetection():

    def __init__(self, graph):

        self.graph = graph
        self.changeInModularity = True
        self.changeCommunityIdUDF = udf(LouvainCommunityDetection.changeCommunityId, IntegerType())


    @staticmethod
    def changeCommunityId(col):

        newCommunityId = 123
        # here I should compute the newCommunityId using complex operations
        # involving other methods in this class
        # like self.computeModularityGain
        # but since this is a static method... I can't use those
        return newCommunityId


    def louvain(self):

        oldModularity = 0 # since intially each node represents a community

        # retrieve graph vertices and edges dataframes
        verticesDf = self.graph.vertices
        edgesDf = self.graph.edges

        canOptimize = True

        while canOptimize:

            while self.changeInModularity:

                self.changeInModularity = False
                verticesDf = verticesDf.select('id', 'tweetCreated', 'userId', 'userName', 'parentId', self.changeCommunityIdUDF('communityId').alias('udfResult'))

                verticesDf.show()

                self.changeInModularity = False

            canOptimize = False

1 个答案:

答案 0 :(得分:1)

我已经找到了解决方法,here有一个很棒且清晰的解释。

问题在于,当对象的任何成员(例如self.changeInModularity)出现在udf函数中时,将其应用于pyspark数据帧时,必须将对象自身进行序列化,但不能对其进行序列化

(非常简单)一种方法是创建对该成员的引用,而不是对象的引用:

changeInModularity = self.changeInModularity