我正在尝试基于复杂的计算(在类内部的方法中)更新数据框的列。 根据到目前为止的经验,您可以使用用户定义的函数来更新数据框中的列。不幸的是,用户定义的函数必须是静态的。 有什么解决方法吗?
这是我代码的相关部分:
'''
Louvain Community Detection Algoritm
'''
class LouvainCommunityDetection():
def __init__(self, graph):
self.graph = graph
self.changeInModularity = True
self.changeCommunityIdUDF = udf(LouvainCommunityDetection.changeCommunityId, IntegerType())
@staticmethod
def changeCommunityId(col):
newCommunityId = 123
# here I should compute the newCommunityId using complex operations
# involving other methods in this class
# like self.computeModularityGain
# but since this is a static method... I can't use those
return newCommunityId
def louvain(self):
oldModularity = 0 # since intially each node represents a community
# retrieve graph vertices and edges dataframes
verticesDf = self.graph.vertices
edgesDf = self.graph.edges
canOptimize = True
while canOptimize:
while self.changeInModularity:
self.changeInModularity = False
verticesDf = verticesDf.select('id', 'tweetCreated', 'userId', 'userName', 'parentId', self.changeCommunityIdUDF('communityId').alias('udfResult'))
verticesDf.show()
self.changeInModularity = False
canOptimize = False
答案 0 :(得分:1)
我已经找到了解决方法,here有一个很棒且清晰的解释。
问题在于,当对象的任何成员(例如self.changeInModularity
)出现在udf函数中时,将其应用于pyspark数据帧时,必须将对象自身进行序列化,但不能对其进行序列化
(非常简单)一种方法是创建对该成员的引用,而不是对象的引用:
changeInModularity = self.changeInModularity