Question

我正在尝试基于复杂的计算（在类内部的方法中）更新数据框的列。根据到目前为止的经验，您可以使用用户定义的函数来更新数据框中的列。不幸的是，用户定义的函数必须是静态的。有什么解决方法吗？

这是我代码的相关部分：

'''
Louvain Community Detection Algoritm
'''
class LouvainCommunityDetection():

    def __init__(self, graph):

        self.graph = graph
        self.changeInModularity = True
        self.changeCommunityIdUDF = udf(LouvainCommunityDetection.changeCommunityId, IntegerType())


    @staticmethod
    def changeCommunityId(col):

        newCommunityId = 123
        # here I should compute the newCommunityId using complex operations
        # involving other methods in this class
        # like self.computeModularityGain
        # but since this is a static method... I can't use those
        return newCommunityId


    def louvain(self):

        oldModularity = 0 # since intially each node represents a community

        # retrieve graph vertices and edges dataframes
        verticesDf = self.graph.vertices
        edgesDf = self.graph.edges

        canOptimize = True

        while canOptimize:

            while self.changeInModularity:

                self.changeInModularity = False
                verticesDf = verticesDf.select('id', 'tweetCreated', 'userId', 'userName', 'parentId', self.changeCommunityIdUDF('communityId').alias('udfResult'))

                verticesDf.show()

                self.changeInModularity = False

            canOptimize = False

Answer 1

我已经找到了解决方法，here有一个很棒且清晰的解释。

问题在于，当对象的任何成员（例如self.changeInModularity）出现在udf函数中时，将其应用于pyspark数据帧时，必须将对象自身进行序列化，但不能对其进行序列化

（非常简单）一种方法是创建对该成员的引用，而不是对象的引用：

changeInModularity = self.changeInModularity

在UDF函数pyspark中使用对象方法

1 个答案: