Question

在PySpark中，我想将dict作为UDF函数的第二个参数传递，该函数调用一个函数以应用于我的数据帧的一列。我精确地说，我不想传递常量，而是传递字典，因此我不知道如何构造udf第二个参数的de structtype。

这是我要应用于列的函数：

def enr_sex_code(column, insee_names):
    """
    Return the sex code found in insee_names parameter that is
    insee birth file or 0 if not exist.
    """
    return insee_names.get(column.upper(), 0)

“ insee_names”是字典。

这是我当前的udf：

# Create udf function to enrich sex code
enr_sex_code_udf = F.udf(
    normalize.enr_sex_code,
    ArrayType(
        StringType(),
        IntegerType()
    )
)

# Apply sex code enrichment udf
insee_sex_codes = {
    'George': 2,
    'Sarah': 1
}
enr_sex_code_udf(df['norm_first_name'], insee_sex_codes)

但是它不起作用，我在UDF中的第二个参数的结构不正确。

所以，我想知道如何在udf中定义字典结构，以便能够传递python字典。我不知道数据类型必须要作为第二个参数传递给udf才能调用我的“ enr_sex_code”函数。

你能帮我吗？

非常感谢您。

最后，我在这篇文章中找到了解决方案：PySpark create new column with mapping from a dict

我的最终代码是：

# Retrieve insee sex codes
insee_sex_codes = get_insee_sex_codes(client)
# Create a mapping with first names and sex codes to create a new column
mapping_sex_codes = F.create_map([F.lit(x) for x in chain(*insee_sex_codes.items())])
# Create sex code enrichment column from the mapping created compared to
# the normalized first name column
df = df.withColumn(
    'enr_sex_code',
    mapping_sex_codes.getItem(
        F.upper(F.col('norm_first_name'))
    )
)

在PySpark（Apache Spark 2.4）中创建带有额外dict参数的UDF

0 个答案: