我有一个函数,该函数试图将广播变量传递给UDF。
函数如下:
{
"stores": [
{
"storeName": "Master Bistro",
"storeId": "3046",
"attendants": [
{
"attendantName": "Janis Joplin",
"attendantId": "9784526",
"total": 2000,
"tenderTotal": {
"Cash": 500,
"TC": 0,
"UOD": 500,
"MC": 250,
"VI": 250,
"AX": 250,
"DI": 250,
"JC": 0,
"DC": 0,
"UOP": 0,
"GN": 0,
"UOGC": 0,
"HOTEL": 0,
"NCTNCG": 0
}
},
{
"attendantName": "David Bowie",
"attendantId": "2589456",
"total": 14675,
"tenderTotal": {
"Cash": 175,
"TC": 0,
"UOD": 100,
"MC": 9500,
"VI": 3500,
"AX": 550,
"DI": 850,
"JC": 0,
"DC": 0,
"UOP": 0,
"GN": 0,
"UOGC": 0,
"HOTEL": 0,
"NCTNCG": 0
}
},
{
"attendantName": "Michael Jackson",
"attendantId": "5478264",
"total": 15599,
"tenderTotal": {
"Cash": 250,
"TC": 0,
"UOD": 80,
"MC": 5624,
"VI": 6895,
"AX": 2500,
"DI": 250,
"JC": 0,
"DC": 0,
"UOP": 0,
"GN": 0,
"UOGC": 0,
"HOTEL": 0,
"NCTNCG": 0
}
}
],
"message": "Store totals for 08/20/2018",
"date":"08/20/2018"
},{
"storeName": "The Master Marketplace",
"storeId": "3047",
"attendants": [
{
"attendantName": "Dirk Novitski",
"attendantId": "9784527",
"total": 2000,
"tenderTotal": {
"Cash": 500,
"TC": 0,
"UOD": 500,
"MC": 250,
"VI": 250,
"AX": 250,
"DI": 250,
"JC": 0,
"DC": 0,
"UOP": 0,
"GN": 0,
"UOGC": 0,
"HOTEL": 0,
"NCTNCG": 0
}
},
{
"attendantName": "Carmello Anthony",
"attendantId": "2589458",
"total": 14675,
"tenderTotal": {
"Cash": 175,
"TC": 0,
"UOD": 100,
"MC": 9500,
"VI": 3500,
"AX": 550,
"DI": 850,
"JC": 0,
"DC": 0,
"UOP": 0,
"GN": 0,
"UOGC": 0,
"HOTEL": 0,
"NCTNCG": 0
}
},
{
"attendantName": "Stevie Wonder",
"attendantId": "5478266",
"total": 15599,
"tenderTotal": {
"Cash": 250,
"TC": 0,
"UOD": 80,
"MC": 5624,
"VI": 6895,
"AX": 2500,
"DI": 250,
"JC": 0,
"DC": 0,
"UOP": 0,
"GN": 0,
"UOGC": 0,
"HOTEL": 0,
"NCTNCG": 0
}
}
],
"message": "Store totals for 08/22/2018",
"date":"08/21/2018"
}
]
}
我的意图只是尝试将广播变量传递给UDF,但是,我得到了错误:
def generate_lookup_code(self, lookup_map):
lookup_map_broadcast = spark_session.sparkContext.broadcast(lookup_map)
print("lookup_map has been broadcasted")
#### UDF function only return a constant string###
def _generate_code(bc_reasoncode_lookup_map):
reasoncode_lookup_map = bc_reasoncode_lookup_map.value
return "hello"
udfGenerateCode = F.udf(_generate_code, StringType())
input_df = input_df.withColumn('code', udfGenerateCode(lookup_map_broadcast))
input_df.show()
我不知道哪里错了?
答案 0 :(得分:0)
您不需要将广播变量作为UDF参数传递,只需从函数中引用它即可:
lookup_map_broadcast = spark_session.sparkContext.broadcast(lookup_map)
def _generate_code():
reasoncode_lookup_map = lookup_map_broadcast.value
return "hello"
udfGenerateCode = F.udf(_generate_code, StringType())
input_df = input_df.withColumn('code', udfGenerateCode())
为每行调用一个UDF,它可以接受列或文字。