使用VectorIndexer或OneHotEncoder进行分类变量?

时间:2020-09-14 08:55:18

标签: pyspark

在处理分类变量作为Spark中ML算法的输入时,对VectorIndexer或OneHotEncoder的用法略有困惑。 是我需要了解ML输出中每个类别级别的效果时,是否需要使用OneHotEncoder,而在其他情况下可以使用VectorIndexer吗?

示例如下所示:

from pyspark.ml.feature import OneHotEncoder, VectorAssembler , VectorIndexer

df = sqlContext.createDataFrame([
    (0.0, 3.0, 3.8),
    (1.0, 0.0, 6.7),
    (2.0, 3.0, 3.3),
    (0.0, 2.0, 1.2),
    (0.0, 1.0, 7.8),
    (2.0, 0.0, 4.4)
], ["category1", "category2","readings"])

encoder = OneHotEncoder(dropLast = True, inputCols=["category1", "category2"],
                        outputCols=["categoryVec1", "categoryVec2"])
model = encoder.fit(df)
encoded = model.transform(df)
encoded.show()


+---------+---------+--------+-------------+-------------+
|category1|category2|readings| categoryVec1| categoryVec2|
+---------+---------+--------+-------------+-------------+
|      0.0|      3.0|     3.8|(2,[0],[1.0])|    (3,[],[])|
|      1.0|      0.0|     6.7|(2,[1],[1.0])|(3,[0],[1.0])|
|      2.0|      3.0|     3.3|    (2,[],[])|    (3,[],[])|
|      0.0|      2.0|     1.2|(2,[0],[1.0])|(3,[2],[1.0])|
|      0.0|      1.0|     7.8|(2,[0],[1.0])|(3,[1],[1.0])|
|      2.0|      0.0|     4.4|    (2,[],[])|(3,[0],[1.0])|
+---------+---------+--------+-------------+-------------+


va = VectorAssembler(inputCols = df.columns , outputCol = 'features')
assembled = va.transform(df)
idx = VectorIndexer(inputCol = 'features', outputCol = 'features_indexed', maxCategories = 4)
idx_model = idx.fit(assembled)
transformed = idx_model.transform(assembled)
transformed.show()

+---------+---------+--------+-------------+----------------+
|category1|category2|readings|     features|features_indexed|
+---------+---------+--------+-------------+----------------+
|      0.0|      3.0|     3.8|[0.0,3.0,3.8]|   [0.0,3.0,3.8]|
|      1.0|      0.0|     6.7|[1.0,0.0,6.7]|   [1.0,0.0,6.7]|
|      2.0|      3.0|     3.3|[2.0,3.0,3.3]|   [2.0,3.0,3.3]|
|      0.0|      2.0|     1.2|[0.0,2.0,1.2]|   [0.0,2.0,1.2]|
|      0.0|      1.0|     7.8|[0.0,1.0,7.8]|   [0.0,1.0,7.8]|
|      2.0|      0.0|     4.4|[2.0,0.0,4.4]|   [2.0,0.0,4.4]|
+---------+---------+--------+-------------+----------------+

idx_model.categoryMaps

{0: {0.0: 0, 1.0: 1, 2.0: 2}, 1: {0.0: 0, 1.0: 1, 2.0: 2, 3.0: 3}}

1 个答案:

答案 0 :(得分:1)

据我所知,OneHotEncoder 仅适用于数字列。如果您的分类变量是 StringType,那么您需要先通过 StringIndexer 传递它,然后才能应用 OneHotEncoder。
StringIndexer 将标签转换为数字,然后 OneHotEncoder 为每个值创建编码列。
Spark 输出 OneHotEncoder 结果的方式是不直观的,docs 在注释部分说:

<块引用>

这与 scikit-learn 的 OneHotEncoder 不同,后者保留所有类别。输出向量是稀疏的。

如果您的分类列是向量或字符串数​​组,那么您将使用 VectorIndexer,然后使用 OneHotEncoder。具体来说,您可以在“功能”列上使用 VectorIndexer。这是similar question

您需要先在分类列中填充空值。
在 PySpark 中,它是 df.na.fill("value", subset=["col1","col2",...])
在 Scala 中,这是 df.na.fill("value", Seq("col1","col2",...))

这是完整的应用示例,

dummydata= [
  (1,"John","B.A.",20,"Male"),
  (2,"Martha","B.Com.",None,"Female"),
  (3,"Mona","B.Com.",21,"Female"),
  (4,"Harish","B.Sc.",22,"Male"),
  (5,"Sam",None,35,"Male"),
  (6,"Jonny","B.A.",22,"Male"),
  (7,"Maria","B.A.",None,"Female"),
  (8,None,"B.A.",25,"Male"),
  (9,"Monalisa","B.A.",21,"Female")
]

toydf= spark.createDataFrame(data = dummydata, schema = ["id", "name", "qualification", "age", "gender"])

toydf.show()
+---+--------+-------------+----+------+
| id|    name|qualification| age|gender|
+---+--------+-------------+----+------+
|  1|    John|         B.A.|  20|  Male|
|  2|  Martha|       B.Com.|null|Female|
|  3|    Mona|       B.Com.|  21|Female|
|  4|  Harish|        B.Sc.|  22|  Male|
|  5|     Sam|         null|  35|  Male|
|  6|   Jonny|         B.A.|  22|  Male|
|  7|   Maria|         B.A.|null|Female|
|  8|    null|         B.A.|  25|  Male|
|  9|Monalisa|         B.A.|  21|Female|
+---+--------+-------------+----+------+

toydf= toydf\
.na.fill("NA", subset=["name","qualification"])\

toydf.show()
+---+--------+-------------+----+------+
| id|    name|qualification| age|gender|
+---+--------+-------------+----+------+
|  1|    John|         B.A.|  20|  Male|
|  2|  Martha|       B.Com.|null|Female|
|  3|    Mona|       B.Com.|  21|Female|
|  4|  Harish|        B.Sc.|  22|  Male|
|  5|     Sam|           NA|  35|  Male|
|  6|   Jonny|         B.A.|  22|  Male|
|  7|   Maria|         B.A.|null|Female|
|  8|      NA|         B.A.|  25|  Male|
|  9|Monalisa|         B.A.|  21|Female|
+---+--------+-------------+----+------+
from pyspark.ml.feature import OneHotEncoder, VectorAssembler, StringIndexer, VectorIndexer

indexer_1= StringIndexer(inputCols= ["qualification"], outputCols=["qual_index"], handleInvalid='keep', stringOrderType='frequencyDesc')

ohe_1= OneHotEncoder(inputCols=["qual_index"], outputCols=["qual_coded"], handleInvalid='keep',dropLast=True)

toydf= indexer_1.fit(toydf).transform(toydf)
toydf= ohe_1.fit(toydf).transform(toydf)

toydf.show()
+---+--------+-------------+----+------+----------+-------------+
| id|    name|qualification| age|gender|qual_index|   qual_coded|
+---+--------+-------------+----+------+----------+-------------+
|  1|    John|         B.A.|  20|  Male|       0.0|(5,[0],[1.0])|
|  2|  Martha|       B.Com.|null|Female|       1.0|(5,[1],[1.0])|
|  3|    Mona|       B.Com.|  21|Female|       1.0|(5,[1],[1.0])|
|  4|  Harish|        B.Sc.|  22|  Male|       2.0|(5,[2],[1.0])|
|  5|     Sam|           NA|  35|  Male|       3.0|(5,[3],[1.0])|
|  6|   Jonny|         B.A.|  22|  Male|       0.0|(5,[0],[1.0])|
|  7|   Maria|         B.A.|null|Female|       0.0|(5,[0],[1.0])|
|  8|      NA|         B.A.|  25|  Male|       0.0|(5,[0],[1.0])|
|  9|Monalisa|         B.A.|  21|Female|       0.0|(5,[0],[1.0])|
+---+--------+-------------+----+------+----------+-------------+