I have broadcast variable setup in a separate py file and I am then importing the same in a file that contains my UDFs. But when I try to use this variable in UDF, I see that broadcast variable is not initialized (NoneType) when used in the scope of some Dataframe
transformation function. Here is the supporting code.
Broadcast model is in utils.py
and defined as below,
class Broadcaster(object):
_map = {}
_bv = None
@staticmethod
def set_item(k, v):
Broadcaster._map[k] = v
@staticmethod
def broadcast(sc):
Broadcaster._bv = sc.broadcast(Broadcaster._map)
@staticmethod
def get_item(k):
val = Broadcaster._bv.value
return val.get(k)
Reason for doing this is to provide an interface where multiple k,v combinations can be set before broadcasting. Which means, in my main.py
, I can now call Broadcaster.set_item(k, v)
multiple times and then eventually call Broadcaster.broadcast(sc)
which is working fine. But now, I want to use this broadcast variable in UDF which is in a separate file (say udfs.py
). Note that these UDFs are used in my Dataframe
processing. Below is a sample UDF,
def my_udf(col):
bv = Broadcaster._bv.value #this throws exception :-(
#more code
In my udfs.py
file, I have no trouble accessing Broadcaster._bv.value
. Just that when used within udf and when this udf is called from within Dataframe
, I am getting NoneType
doesn't have value
exception. Basically worker nodes are unable to access broadcasted variable. Why can't I use the broadcast variable in cross files? I have seen examples where people are defining udf in the same file where broadcasted variable is present and it seem to be working fine. But I need to have these in separate files due to the volume of code. What are my options?
EDIT: I don't want to serialize the object, pass it to UDF during call and de-serialize within UDF. I believe that defeats the purpose of broadcast variable.