在处理分类变量作为Spark中ML算法的输入时,对VectorIndexer或OneHotEncoder的用法略有困惑。 是我需要了解ML输出中每个类别级别的效果时,是否需要使用OneHotEncoder,而在其他情况下可以使用VectorIndexer吗?
示例如下所示:
from pyspark.ml.feature import OneHotEncoder, VectorAssembler , VectorIndexer
df = sqlContext.createDataFrame([
(0.0, 3.0, 3.8),
(1.0, 0.0, 6.7),
(2.0, 3.0, 3.3),
(0.0, 2.0, 1.2),
(0.0, 1.0, 7.8),
(2.0, 0.0, 4.4)
], ["category1", "category2","readings"])
encoder = OneHotEncoder(dropLast = True, inputCols=["category1", "category2"],
outputCols=["categoryVec1", "categoryVec2"])
model = encoder.fit(df)
encoded = model.transform(df)
encoded.show()
+---------+---------+--------+-------------+-------------+
|category1|category2|readings| categoryVec1| categoryVec2|
+---------+---------+--------+-------------+-------------+
| 0.0| 3.0| 3.8|(2,[0],[1.0])| (3,[],[])|
| 1.0| 0.0| 6.7|(2,[1],[1.0])|(3,[0],[1.0])|
| 2.0| 3.0| 3.3| (2,[],[])| (3,[],[])|
| 0.0| 2.0| 1.2|(2,[0],[1.0])|(3,[2],[1.0])|
| 0.0| 1.0| 7.8|(2,[0],[1.0])|(3,[1],[1.0])|
| 2.0| 0.0| 4.4| (2,[],[])|(3,[0],[1.0])|
+---------+---------+--------+-------------+-------------+
va = VectorAssembler(inputCols = df.columns , outputCol = 'features')
assembled = va.transform(df)
idx = VectorIndexer(inputCol = 'features', outputCol = 'features_indexed', maxCategories = 4)
idx_model = idx.fit(assembled)
transformed = idx_model.transform(assembled)
transformed.show()
+---------+---------+--------+-------------+----------------+
|category1|category2|readings| features|features_indexed|
+---------+---------+--------+-------------+----------------+
| 0.0| 3.0| 3.8|[0.0,3.0,3.8]| [0.0,3.0,3.8]|
| 1.0| 0.0| 6.7|[1.0,0.0,6.7]| [1.0,0.0,6.7]|
| 2.0| 3.0| 3.3|[2.0,3.0,3.3]| [2.0,3.0,3.3]|
| 0.0| 2.0| 1.2|[0.0,2.0,1.2]| [0.0,2.0,1.2]|
| 0.0| 1.0| 7.8|[0.0,1.0,7.8]| [0.0,1.0,7.8]|
| 2.0| 0.0| 4.4|[2.0,0.0,4.4]| [2.0,0.0,4.4]|
+---------+---------+--------+-------------+----------------+
idx_model.categoryMaps
{0: {0.0: 0, 1.0: 1, 2.0: 2}, 1: {0.0: 0, 1.0: 1, 2.0: 2, 3.0: 3}}
答案 0 :(得分:1)
据我所知,OneHotEncoder
仅适用于数字列。如果您的分类变量是 StringType,那么您需要先通过 StringIndexer
传递它,然后才能应用 OneHotEncoder。
StringIndexer 将标签转换为数字,然后 OneHotEncoder 为每个值创建编码列。
Spark 输出 OneHotEncoder 结果的方式是不直观的,docs 在注释部分说:
这与 scikit-learn 的 OneHotEncoder 不同,后者保留所有类别。输出向量是稀疏的。
如果您的分类列是向量或字符串数组,那么您将使用 VectorIndexer
,然后使用 OneHotEncoder
。具体来说,您可以在“功能”列上使用 VectorIndexer。这是similar question。
您需要先在分类列中填充空值。
在 PySpark 中,它是 df.na.fill("value", subset=["col1","col2",...])
。
在 Scala 中,这是 df.na.fill("value", Seq("col1","col2",...))
这是完整的应用示例,
dummydata= [
(1,"John","B.A.",20,"Male"),
(2,"Martha","B.Com.",None,"Female"),
(3,"Mona","B.Com.",21,"Female"),
(4,"Harish","B.Sc.",22,"Male"),
(5,"Sam",None,35,"Male"),
(6,"Jonny","B.A.",22,"Male"),
(7,"Maria","B.A.",None,"Female"),
(8,None,"B.A.",25,"Male"),
(9,"Monalisa","B.A.",21,"Female")
]
toydf= spark.createDataFrame(data = dummydata, schema = ["id", "name", "qualification", "age", "gender"])
toydf.show()
+---+--------+-------------+----+------+
| id| name|qualification| age|gender|
+---+--------+-------------+----+------+
| 1| John| B.A.| 20| Male|
| 2| Martha| B.Com.|null|Female|
| 3| Mona| B.Com.| 21|Female|
| 4| Harish| B.Sc.| 22| Male|
| 5| Sam| null| 35| Male|
| 6| Jonny| B.A.| 22| Male|
| 7| Maria| B.A.|null|Female|
| 8| null| B.A.| 25| Male|
| 9|Monalisa| B.A.| 21|Female|
+---+--------+-------------+----+------+
toydf= toydf\
.na.fill("NA", subset=["name","qualification"])\
toydf.show()
+---+--------+-------------+----+------+
| id| name|qualification| age|gender|
+---+--------+-------------+----+------+
| 1| John| B.A.| 20| Male|
| 2| Martha| B.Com.|null|Female|
| 3| Mona| B.Com.| 21|Female|
| 4| Harish| B.Sc.| 22| Male|
| 5| Sam| NA| 35| Male|
| 6| Jonny| B.A.| 22| Male|
| 7| Maria| B.A.|null|Female|
| 8| NA| B.A.| 25| Male|
| 9|Monalisa| B.A.| 21|Female|
+---+--------+-------------+----+------+
from pyspark.ml.feature import OneHotEncoder, VectorAssembler, StringIndexer, VectorIndexer
indexer_1= StringIndexer(inputCols= ["qualification"], outputCols=["qual_index"], handleInvalid='keep', stringOrderType='frequencyDesc')
ohe_1= OneHotEncoder(inputCols=["qual_index"], outputCols=["qual_coded"], handleInvalid='keep',dropLast=True)
toydf= indexer_1.fit(toydf).transform(toydf)
toydf= ohe_1.fit(toydf).transform(toydf)
toydf.show()
+---+--------+-------------+----+------+----------+-------------+
| id| name|qualification| age|gender|qual_index| qual_coded|
+---+--------+-------------+----+------+----------+-------------+
| 1| John| B.A.| 20| Male| 0.0|(5,[0],[1.0])|
| 2| Martha| B.Com.|null|Female| 1.0|(5,[1],[1.0])|
| 3| Mona| B.Com.| 21|Female| 1.0|(5,[1],[1.0])|
| 4| Harish| B.Sc.| 22| Male| 2.0|(5,[2],[1.0])|
| 5| Sam| NA| 35| Male| 3.0|(5,[3],[1.0])|
| 6| Jonny| B.A.| 22| Male| 0.0|(5,[0],[1.0])|
| 7| Maria| B.A.|null|Female| 0.0|(5,[0],[1.0])|
| 8| NA| B.A.| 25| Male| 0.0|(5,[0],[1.0])|
| 9|Monalisa| B.A.| 21|Female| 0.0|(5,[0],[1.0])|
+---+--------+-------------+----+------+----------+-------------+