Question

spark中的VectorIndexer根据变量的频率索引分类特征。但我想以不同的方式索引分类特征。

例如，使用如下数据集，＆＃34; a＆＃34;，＆＃34; b＆＃34;，＆＃34; c＆＃34;如果我在spark中使用VectorIndexer，将被索引为0,1,2。但我想根据标签对它们进行索引。有4行数据被索引为1，其中3行有特征＆＃39; a＆＃39; 1行feautre＆＃39; c＆＃39;。所以在这里我将索引＆＃39; a＆＃39;为0，＆＃39; c＆＃39;作为1和＆＃39; b＆＃39;为2。

有没有方便的方法来实现这个？

 label|feature
-----------------
    1 | a
    1 | c
    0 | a
    0 | b
    1 | a
    0 | b
    0 | b
    0 | c
    1 | a

Answer 1

如果我正确理解您的问题，您希望复制StringIndexer()对分组数据的行为。要执行此操作（在pySpark中），我们首先定义一个udf，它将在List列上运行，其中包含每个组的所有值。请注意，具有相同计数的元素将被任意排序。

from collections import Counter
from pyspark.sql.types import ArrayType, IntegerType

def encoder(col):

  # Generate count per letter
  x = Counter(col)

  # Create a dictionary, mapping each letter to its rank
  ranking = {pair[0]: rank 
           for rank, pair in enumerate(x.most_common())}

  # Use dictionary to replace letters by rank
  new_list = [ranking[i] for i in col]

  return(new_list)

encoder_udf = udf(encoder, ArrayType(IntegerType()))

现在，我们可以使用feature将label列聚合到按collect_list()列分组的列表中，并按行udf行：

from pyspark.sql.functions import collect_list, explode

df1 = (df.groupBy("label")
       .agg(collect_list("feature")
            .alias("features"))
       .withColumn("index", 
                   encoder_udf("features")))

因此，您可以展开index列以获取编码值而不是字母：

df1.select("label", explode(df1.index).alias("index")).show()
+-----+-----+
|label|index|
+-----+-----+
|    0|    1|
|    0|    0|
|    0|    0|
|    0|    0|
|    0|    2|
|    1|    0|
|    1|    1|
|    1|    0|
|    1|    0|
+-----+-----+

当使用spark ml时，如何以另一种方式索引分类特征

1 个答案: