pyspark - 低频类别组合代码面临的Pyspark错误

现在在pyspark中创建udf函数时遇到另一个错误。我的数据中的“商户类别代码”字段的维数很高。我想减少其类别。对于该字段所有类别的计数小于1000的“我的方法”。我想分配新类别（MCC_lowcount）。

但是它在下面的代码的最后一行抛出错误

Error: TypeError: Invalid argument, not a string or column: 1000 of type <class 'int'>. 
For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

代码：

def cut_levels(x, threshold, new_value):
        value_counts = x1.value_counts()
        labels = value_counts.index[value_counts < threshold]
        x[np.in1d(x, labels)] = new_value
    udf_cut_levels=udf(cut_levels,StringType())
    udf_cut_levels(df1['Merchant Category Code'], 1000, 'MCC_lowcount')

低频类别组合代码面临的Pyspark错误

0 个答案: