将广播变量传递到UDF,Pyspark时出错

时间:2018-08-30 18:07:25

标签: apache-spark pyspark broadcast

我有一个函数,该函数试图将广播变量传递给UDF。

函数如下:

{
  "stores": [
        {
        "storeName": "Master Bistro",
        "storeId": "3046",
        "attendants": [
            {
            "attendantName": "Janis Joplin",
            "attendantId": "9784526",
            "total": 2000,
            "tenderTotal": {
                "Cash": 500,
                "TC": 0,
                "UOD": 500,
                "MC": 250,
                "VI": 250,
                "AX": 250,
                "DI": 250,
                "JC": 0,
                "DC": 0,
                "UOP": 0,
                "GN": 0,
                "UOGC": 0,
                "HOTEL": 0,
                "NCTNCG": 0
                }
            },
            {
            "attendantName": "David Bowie",
            "attendantId": "2589456",
            "total": 14675,
            "tenderTotal": {
                "Cash": 175,
                "TC": 0,
                "UOD": 100,
                "MC": 9500,
                "VI": 3500,
                "AX": 550,
                "DI": 850,
                "JC": 0,
                "DC": 0,
                "UOP": 0,
                "GN": 0,
                "UOGC": 0,
                "HOTEL": 0,
                "NCTNCG": 0
                }
            },
            {
            "attendantName": "Michael Jackson",
            "attendantId": "5478264",
            "total": 15599,
                "tenderTotal": {
                    "Cash": 250,
                    "TC": 0,
                    "UOD": 80,
                    "MC": 5624,
                    "VI": 6895,
                    "AX": 2500,
                    "DI": 250,
                    "JC": 0,
                    "DC": 0,
                    "UOP": 0,
                    "GN": 0,
                    "UOGC": 0,
                    "HOTEL": 0,
                    "NCTNCG": 0
                }
            }
        ],
            "message": "Store totals for 08/20/2018",
            "date":"08/20/2018"
    },{

        "storeName": "The Master  Marketplace",
        "storeId": "3047",
        "attendants": [
            {
                "attendantName": "Dirk Novitski",
                "attendantId": "9784527",
                "total": 2000,
                "tenderTotal": {
                    "Cash": 500,
                    "TC": 0,
                    "UOD": 500,
                    "MC": 250,
                    "VI": 250,
                    "AX": 250,
                    "DI": 250,
                    "JC": 0,
                    "DC": 0,
                    "UOP": 0,
                    "GN": 0,
                    "UOGC": 0,
                    "HOTEL": 0,
                    "NCTNCG": 0
                }
            },
            {
                "attendantName": "Carmello Anthony",
                "attendantId": "2589458",
                "total": 14675,
                "tenderTotal": {
                    "Cash": 175,
                    "TC": 0,
                    "UOD": 100,
                    "MC": 9500,
                    "VI": 3500,
                    "AX": 550,
                    "DI": 850,
                    "JC": 0,
                    "DC": 0,
                    "UOP": 0,
                    "GN": 0,
                    "UOGC": 0,
                    "HOTEL": 0,
                    "NCTNCG": 0
                }
            },
            {
                "attendantName": "Stevie Wonder",
                "attendantId": "5478266",
                "total": 15599,
                "tenderTotal": {
                    "Cash": 250,
                    "TC": 0,
                    "UOD": 80,
                    "MC": 5624,
                    "VI": 6895,
                    "AX": 2500,
                    "DI": 250,
                    "JC": 0,
                    "DC": 0,
                    "UOP": 0,
                    "GN": 0,
                    "UOGC": 0,
                    "HOTEL": 0,
                    "NCTNCG": 0
                }

            }
        ],
            "message": "Store totals for 08/22/2018",
            "date":"08/21/2018"
        }
    ]    
}

我的意图只是尝试将广播变量传递给UDF,但是,我得到了错误:

def generate_lookup_code(self, lookup_map):

    lookup_map_broadcast = spark_session.sparkContext.broadcast(lookup_map)
    print("lookup_map has been broadcasted")

    #### UDF function only return a constant string###
    def _generate_code(bc_reasoncode_lookup_map):

        reasoncode_lookup_map = bc_reasoncode_lookup_map.value
        return "hello"


    udfGenerateCode = F.udf(_generate_code, StringType())

    input_df = input_df.withColumn('code', udfGenerateCode(lookup_map_broadcast))

    input_df.show()

我不知道哪里错了?

1 个答案:

答案 0 :(得分:0)

您不需要将广播变量作为UDF参数传递,只需从函数中引用它即可:

lookup_map_broadcast = spark_session.sparkContext.broadcast(lookup_map)

def _generate_code():
    reasoncode_lookup_map = lookup_map_broadcast.value
    return "hello"

udfGenerateCode = F.udf(_generate_code, StringType())
input_df = input_df.withColumn('code', udfGenerateCode())

为每行调用一个UDF,它可以接受列或文字。