Question

我正在寻找等效变压器，例如sklearn中的MultiLabelBinarizer。

到目前为止，我所发现的只有Binarizer这并不是我真正需要的。

我也在查看this文档，但我看不到任何符合我想要的内容。

我的输入是一列，其中每个元素都是标签列表：

labels    
['a', 'b']
['a']
['c', 'b']
['a', 'c']

输出应为

labels
[1, 1, 0]
[1, 0, 0]
[0, 1, 1]
[1, 0, 1]

PySpark与此相当的是什么？

Answer 1

下面的解决方案可能没有特别优化，但我认为它非常简单并且可以快速完成工作。
我们基本上创建了一个函数，用于收集 labels 列中包含的所有 distinct 值，然后为 labels 列中遇到的每个值动态创建一个 0/1 的列。

import pyspark.sql.functions as F


def multi_label_binarizer(df, labels_col='labels', output_col='new_labels'):
    """
    Function that takes as input:
    - `df`, pyspark.sql.dataframe 
    - `labels_col`, string that indicates an array column containing labels
    - `output_col`, string that indicates the name of the new labels column
    
    and returns a multi-label binarized column.
    """
    
    # get set of unique labels and sort them
    labels_set = df\
        .withColumn('exploded', F.explode('labels'))\
        .agg(F.collect_set('exploded'))\
        .collect()[0][0]
    labels_set = sorted(labels_set)
    
    # dynamically create columns for each value in `labels_set`
    for i in labels_set:
        df = df.withColumn(i, F.when(F.array_contains(labels_col, i), 1).otherwise(0))
        
    # create new, multi-label binarized array column
    df = df.withColumn(output_col, F.array(*labels_set))
    
    return df


multi_label_binarizer(df).show()

+------+---+---+---+----------+
|labels|  a|  b|  c|new_labels|
+------+---+---+---+----------+
|[a, b]|  1|  1|  0| [1, 1, 0]|
|   [a]|  1|  0|  0| [1, 0, 0]|
|[c, b]|  0|  1|  1| [0, 1, 1]|
|[a, c]|  1|  0|  1| [1, 0, 1]|
+------+---+---+---+----------+

Spark中的MultiLabelBinarizer？

1 个答案: