Question

我的目标是使用Spark DataFrame对列表中的一列进行热编码。例如，get_dummies()中的功能类似于Pandas。

数据集bureau.csv最初取自Kaggle竞赛Home Credit Default Risk。这是我的条目表示例，例如entryData，其中只有KEY = 100001被过滤。

# primary key
KEY = 'SK_ID_CURR'
data = spark.read.csv("bureau.csv", header=True, inferSchema=True)
# sample data from bureau.csv of 1716428 rows
entryData = data.select(columnList).where(F.col(KEY) == 100001).show()
print(entryData)

+----------+-------------+---------------+---------------+
|SK_ID_CURR|CREDIT_ACTIVE|CREDIT_CURRENCY|    CREDIT_TYPE|
+----------+-------------+---------------+---------------+
|    100001|       Closed|     currency 1|Consumer credit|
|    100001|       Closed|     currency 1|Consumer credit|
|    100001|       Closed|     currency 1|Consumer credit|
|    100001|       Closed|     currency 1|Consumer credit|
|    100001|       Active|     currency 1|Consumer credit|
|    100001|       Active|     currency 1|Consumer credit|
|    100001|       Active|     currency 1|Consumer credit|
+----------+-------------+---------------+---------------+

我正在通过创建函数columnList对列表catg_encode(entryData, columnList)进行一次热编码，

columnList = cols_type(entryData, obj=True)[1:]
print(columnList)
['CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE']

注意 cols_type()是一种返回列列表的函数，该列可以是分类列（如果为obj=True）或数字列（如果为obj=False）。

我已经成功地对第一列'CREDIT_ACTIVE'进行了一次热编码，但是我无法同时对Hole列进行编码，我的意思是构建函数catg_encode。

# import necessary modules
from pyspark.sql import functions as F

# look for all distinct categoris within a given feature (here 'CREDIT_ACTIVE')
categories = entryData.select(columnList[0]).distinct().rdd.flatMap(lambda x: x).collect()
# one-hot encode the categories
exprs = [F.when(F.col(columnList[0]) == category, 1).otherwise(0).alias(category) for category in categories]
# nice table with encoded feature 'CREDIT_ACTIVE'
oneHotEncode = entryData.select(KEY, *exprs)
print(oneHotEncode)

+----------+--------+----+------+------+
|SK_ID_CURR|Bad debt|Sold|Active|Closed|
+----------+--------+----+------+------+
|    100001|       0|   0|     0|     1|
|    100001|       0|   0|     0|     1|
|    100001|       0|   0|     0|     1|
|    100001|       0|   0|     0|     1|
|    100001|       0|   0|     1|     0|
|    100001|       0|   0|     1|     0|
|    100001|       0|   0|     1|     0|
+----------+--------+----+------+------+

功能'CREDIT_ACTIVE'在这里有4个不同的类别； ['Bad debt', 'Sold', 'Active', 'Closed']。

注意，我什至尝试了IndexToString和OneHotEncoderEstimator，但没有帮助完成此特定任务。

我希望获得以下输出，

+----------+--------+----+------+------+----------+----------+----------+----------+----------+---
|SK_ID_CURR|Bad debt|Sold|Active|Closed|currency 1|currency 2|currency 3|currency 4|..........|...
+----------+--------+----+------+------+----------+----------+----------+----------+----------+---
|    100001|       0|   0|     0|     1|         1|         0|         0|         0|        ..|   
|    100001|       0|   0|     0|     1|         1|         0|         0|         0|        ..|
|    100001|       0|   0|     0|     1|         1|         0|         0|         0|        ..|
|    100001|       0|   0|     0|     1|         1|         0|         0|         0|        ..|
|    100001|       0|   0|     1|     0|         1|         0|         0|         0|        ..|
|    100001|       0|   0|     1|     0|         1|         0|         0|         0|        ..|
|    100001|       0|   0|     1|     0|         1|         0|         0|         0|        ..|
+----------+--------+----+------+------+----------+----------+----------+----------+----------+---

连续点...用于要素'CREDIT_TYPE'的其余类别，

['Loan for the purchase of equipment', 'Cash loan (non-earmarked)', 'Microloan', 'Consumer credit', 'Mobile operator loan', 'Another type of loan', 'Mortgage', 'Interbank credit', 'Loan for working capital replenishment', 'Car loan', 'Real estate loan', 'Unknown type of loan', 'Loan for business development', 'Credit card', 'Loan for purchase of shares (margin lending)']。

Remarque ：我看过这篇文章E-num / get Dummies in pyspark ，但是对于许多列（大数据的情况）并没有自动执行该过程。该帖子提供了一种为每个分类功能编写单独代码的解决方案，这不是我的案例问题。

Answer 1

有两种方法可以榨取这种柠檬。让我们看看它们。

枢纽和加盟

shortid.generate()

使用import pyspark.sql.functions as f df1 = spark._sc.parallelize([ [100001, 'Closed', 'currency 1', 'Consumer credit'], [100001, 'Closed', 'currency 1', 'Consumer credit'], [100001, 'Closed', 'currency 1', 'Consumer credit'], [100001, 'Closed', 'currency 1', 'Consumer credit'], [100001, 'Active', 'currency 1', 'Consumer credit'], [100001, 'Active', 'currency 1', 'Consumer credit'], [100001, 'Active', 'currency 1', 'Consumer credit'], [100002, 'Active', 'currency 2', 'Consumer credit'], ]).toDF(['SK_ID_CURR', 'CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE']) # this can be done dynamically, but I don't have all categories categories = ['Active', 'Closed', 'Bad debt', 'Sold'] # we need to pivot without aggregation, so I need to add an `id` column and group by it as well credit_groups = ( df1.withColumn('id', f.monotonically_increasing_id()) .groupBy('SK_ID_CURR', 'id') .pivot('CREDIT_ACTIVE', values=categories) .agg(f.lit(1)) .drop('id') ) # currency groups are just a 1 for each currency and ID, as per the example data # if this is not the case, something more clever needs to be here currency_groups = df1.groupBy('SK_ID_CURR').pivot('CREDIT_CURRENCY').agg(f.lit(1)) # join the two pivoted tables on the ID and fill nulls to zeroes credit_groups.join(currency_groups, on=['SK_ID_CURR'], how='inner').na.fill(0).show() +----------+------+------+--------+----+----------+----------+ |SK_ID_CURR|Active|Closed|Bad debt|Sold|currency 1|currency 2| +----------+------+------+--------+----+----------+----------+ | 100002| 1| 0| 0| 0| 0| 1| | 100001| 0| 1| 0| 0| 1| 0| | 100001| 1| 0| 0| 0| 1| 0| | 100001| 1| 0| 0| 0| 1| 0| | 100001| 0| 1| 0| 0| 1| 0| | 100001| 0| 1| 0| 0| 1| 0| | 100001| 1| 0| 0| 0| 1| 0| | 100001| 0| 1| 0| 0| 1| 0| +----------+------+------+--------+----+----------+----------+和StringIndexer，例如：

OneHotEncoderEstimator

从现在开始，您将在新创建的数字列上使用一键编码。我个人建议路线1，因为它更具可读性。但是，路线2允许您也将from pyspark.ml import Pipeline from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer indexers = [StringIndexer(inputCol=column, outputCol=column+"_NUMERIC").fit(df1) for column in ['CREDIT_ACTIVE', 'CREDIT_CURRENCY']] pipeline = Pipeline(stages=indexers) df_indexed = pipeline.fit(df1).transform(df1) df_indexed.show() +----------+-------------+---------------+---------------+---------------------+-----------------------+ |SK_ID_CURR|CREDIT_ACTIVE|CREDIT_CURRENCY| CREDIT_TYPE|CREDIT_ACTIVE_NUMERIC|CREDIT_CURRENCY_NUMERIC| +----------+-------------+---------------+---------------+---------------------+-----------------------+ | 100001| Closed| currency 1|Consumer credit| 0.0| 0.0| | 100001| Closed| currency 1|Consumer credit| 0.0| 0.0| | 100001| Closed| currency 1|Consumer credit| 0.0| 0.0| | 100001| Closed| currency 1|Consumer credit| 0.0| 0.0| | 100001| Active| currency 1|Consumer credit| 1.0| 0.0| | 100001| Active| currency 1|Consumer credit| 1.0| 0.0| | 100001| Active| currency 1|Consumer credit| 1.0| 0.0| | 100002| Active| currency 2|Consumer credit| 1.0| 1.0| +----------+-------------+---------------+---------------+---------------------+-----------------------+链接到声明的OneHotEncoderEstimator中，从而使代码在声明后仅一行即可执行。希望这可以帮助。

Answer 2

SparkML中定义的OHE一次只能处理一列，这可能不是最佳选择。您可以自己实现此多列OHE。您实际上是在正确的轨道上。

import pyspark.sql.functions as F

# let's define some data
l = [('a', 1), ('b', 2), ('c', 1), ('a', 1)]
df = spark.createDataFrame(l, ['c1', 'c2'])
# the list of column we want to encode
cols = ['c1', 'c2']

# defining a struct that associates each column name to its value
col_struct = [
  F.struct(F.lit(c).alias('key'),
           F.col(c).cast('string').alias('value')) for c in cols
]

# Then we explode these struct, group by column name and collect the
# distinct values. Finally, we collect everything to the driver.
ohe_rows = df.distinct()\
  .select(*cols).select(F.explode(F.array(*col_struct)).alias("x"))\
  .groupBy("x.key")\
  .agg(F.collect_set(F.col("x.value")).alias("values"))\
  .collect()

# then we build one spark column per column and per value of that column
# so as to encode the values
ohe = [
          [
              F.when(F.col(row['key']) == value, 1)
               .otherwise(0)
               .alias(row['key']+'_'+value) for value in row['values']
          ] for row in ohe_rows
      ]

# ohe is a list of lists so we use itertools to flatten it
import itertools
ohe_list = list(itertools.chain(*ohe))

# and voila
df.select(* [df.c1, df.c2] + ohe_list).show()
+---+---+----+----+----+----+----+
| c1| c2|c1_c|c1_b|c1_a|c2_1|c2_2|
+---+---+----+----+----+----+----+
|  a|  1|   0|   0|   1|   1|   0|
|  b|  2|   0|   1|   0|   0|   1|
|  c|  1|   1|   0|   0|   1|   0|
|  a|  1|   0|   0|   1|   1|   0|
+---+---+----+----+----+----+----+
# or simply df.select(*ohe_list)

使用Spark DataFrames对多个字符串分类特征进行一键编码

2 个答案: