Accessing broadcast variables in user defined function (udf) in separate files

时间:2019-04-23 15:18:34

标签: python apache-spark pyspark user-defined-functions broadcast

I have broadcast variable setup in a separate py file and I am then importing the same in a file that contains my UDFs. But when I try to use this variable in UDF, I see that broadcast variable is not initialized (NoneType) when used in the scope of some Dataframe transformation function. Here is the supporting code.

Broadcast model is in utils.py and defined as below,

class Broadcaster(object):
    _map = {}
    _bv = None

    @staticmethod
    def set_item(k, v):
        Broadcaster._map[k] = v

    @staticmethod
    def broadcast(sc):
        Broadcaster._bv = sc.broadcast(Broadcaster._map)

    @staticmethod
    def get_item(k):
        val = Broadcaster._bv.value
        return val.get(k)

Reason for doing this is to provide an interface where multiple k,v combinations can be set before broadcasting. Which means, in my main.py, I can now call Broadcaster.set_item(k, v) multiple times and then eventually call Broadcaster.broadcast(sc) which is working fine. But now, I want to use this broadcast variable in UDF which is in a separate file (say udfs.py). Note that these UDFs are used in my Dataframe processing. Below is a sample UDF,

def my_udf(col):
    bv = Broadcaster._bv.value    #this throws exception :-(
    #more code

In my udfs.py file, I have no trouble accessing Broadcaster._bv.value. Just that when used within udf and when this udf is called from within Dataframe, I am getting NoneType doesn't have value exception. Basically worker nodes are unable to access broadcasted variable. Why can't I use the broadcast variable in cross files? I have seen examples where people are defining udf in the same file where broadcasted variable is present and it seem to be working fine. But I need to have these in separate files due to the volume of code. What are my options?

EDIT: I don't want to serialize the object, pass it to UDF during call and de-serialize within UDF. I believe that defeats the purpose of broadcast variable.

0 个答案:

没有答案