我正在使用apache Spark ML lib通过一种热编码来处理分类功能。编写下面的代码后,我得到了向量c_idx_vec
作为一种热编码的输出。我确实了解如何解释此输出向量,但无法弄清楚如何将该向量转换为列,以便获得新的转换后的数据帧。例如,以该数据集为例:
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
| x| c|c_idx|
+----+---+-----+
| 1.0| a| 0.0|
| 1.5| a| 0.0|
|10.0| b| 1.0|
| 3.2| c| 2.0|
+----+---+-----+
默认情况下,OneHotEncoder将删除最后一个类别:
>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> fe = oe.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(2,[0],[1.0])|
| 1.5| a| 0.0|(2,[0],[1.0])|
|10.0| b| 1.0|(2,[1],[1.0])|
| 3.2| c| 2.0| (2,[],[])|
+----+---+-----+-------------+
当然,可以更改此行为:
>>> oe.setDropLast(False)
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(3,[0],[1.0])|
| 1.5| a| 0.0|(3,[0],[1.0])|
|10.0| b| 1.0|(3,[1],[1.0])|
| 3.2| c| 2.0|(3,[2],[1.0])|
+----+---+-----+-------------+
因此,我想知道如何将c_idx_vec
向量转换为新的数据帧,如下所示:
答案 0 :(得分:3)
这是您可以做的:
>>> from pyspark.ml.feature import StringIndexer
>>> from pyspark.ml.feature import OneHotEncoder, StringIndexer
>>>
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
| x| c|c_idx|
+----+---+-----+
| 1.0| a| 0.0|
| 1.5| a| 0.0|
|10.0| b| 1.0|
| 3.2| c| 2.0|
+----+---+-----+
>>>
>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> oe.setDropLast(False)
OneHotEncoder_49e58b281387d8dc0c6b
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(3,[0],[1.0])|
| 1.5| a| 0.0|(3,[0],[1.0])|
|10.0| b| 1.0|(3,[1],[1.0])|
| 3.2| c| 2.0|(3,[2],[1.0])|
+----+---+-----+-------------+
// Get c and its repective index. One hot encoder will put those on same index in vector
>>> colIdx = fl.select("c","c_idx").distinct().rdd.collectAsMap()
>>> colIdx
{'c': 2.0, 'b': 1.0, 'a': 0.0}
>>>
>>> colIdx = sorted((value, "ls_" + key) for (key, value) in colIdx.items())
>>> colIdx
[(0.0, 'ls_a'), (1.0, 'ls_b'), (2.0, 'ls_c')]
>>>
>>> newCols = list(map(lambda x: x[1], colIdx))
>>> actualCol = fl.columns
>>> actualCol
['x', 'c', 'c_idx', 'c_idx_vec']
>>> allColNames = actualCol + newCols
>>> allColNames
['x', 'c', 'c_idx', 'c_idx_vec', 'ls_a', 'ls_b', 'ls_c']
>>>
>>> def extract(row):
... return tuple(map(lambda x: row[x], row.__fields__)) + tuple(row.c_idx_vec.toArray().tolist())
...
>>> result = fl.rdd.map(extract).toDF(allColNames)
>>> result.show(20, False)
+----+---+-----+-------------+----+----+----+
|x |c |c_idx|c_idx_vec |ls_a|ls_b|ls_c|
+----+---+-----+-------------+----+----+----+
|1.0 |a |0.0 |(3,[0],[1.0])|1.0 |0.0 |0.0 |
|1.5 |a |0.0 |(3,[0],[1.0])|1.0 |0.0 |0.0 |
|10.0|b |1.0 |(3,[1],[1.0])|0.0 |1.0 |0.0 |
|3.2 |c |2.0 |(3,[2],[1.0])|0.0 |0.0 |1.0 |
+----+---+-----+-------------+----+----+----+
// Typecast new columns to int
>>> for col in newCols:
... result = result.withColumn(col, result[col].cast("int"))
...
>>> result.show(20, False)
+----+---+-----+-------------+----+----+----+
|x |c |c_idx|c_idx_vec |ls_a|ls_b|ls_c|
+----+---+-----+-------------+----+----+----+
|1.0 |a |0.0 |(3,[0],[1.0])|1 |0 |0 |
|1.5 |a |0.0 |(3,[0],[1.0])|1 |0 |0 |
|10.0|b |1.0 |(3,[1],[1.0])|0 |1 |0 |
|3.2 |c |2.0 |(3,[2],[1.0])|0 |0 |1 |
+----+---+-----+-------------+----+----+----+
希望这会有所帮助!
答案 1 :(得分:2)
不确定这是最有效或最简单的方法,但是您可以使用udf来完成;从您的fl
数据帧开始:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
(fl.withColumn('is_a', ith("c_idx_vec", lit(0)))
.withColumn('is_b', ith("c_idx_vec", lit(1)))
.withColumn('is_c', ith("c_idx_vec", lit(2))).show())
结果是:
+----+---+-----+-------------+----+----+----+
| x| c|c_idx| c_idx_vec|is_a|is_b|is_c|
+----+---+-----+-------------+----+----+----+
| 1.0| a| 0.0|(3,[0],[1.0])| 1.0| 0.0| 0.0|
| 1.5| a| 0.0|(3,[0],[1.0])| 1.0| 0.0| 0.0|
|10.0| b| 1.0|(3,[1],[1.0])| 0.0| 1.0| 0.0|
| 3.2| c| 2.0|(3,[2],[1.0])| 0.0| 0.0| 1.0|
+----+---+-----+-------------+----+----+----+
即完全按照要求。
提供udf的HT(和+1)到this answer。
答案 2 :(得分:1)
我找不到用数据帧访问稀疏向量的方法,并将其转换为rdd。
from pyspark.sql import Row
# column names
labels = ['a', 'b', 'c']
extract_f = lambda row: Row(**row.asDict(), **dict(zip(labels, row.c_idx_vec.toArray())))
fe.rdd.map(extract_f).collect()
答案 3 :(得分:1)
鉴于在使用StringIndexer
生成索引号的情况下指定了情况,然后使用OneHotEncoderEstimator
生成了One-hot编码。从头到尾的整个代码应该像这样:
StringIndexerModel
对象被“保存” >>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>>
>>> # need to save the indexer model object for indexing label info to be used later
>>> ss_fit = ss.fit(fd)
>>> ss_fit.labels # to be used later
['a', 'b', 'c']
>>> ff = ss_fit.transform(fd)
>>> ff.show()
+----+---+-----+
| x| c|c_idx|
+----+---+-----+
| 1.0| a| 0.0|
| 1.5| a| 0.0|
|10.0| b| 1.0|
| 3.2| c| 2.0|
+----+---+-----+
OneHotEncoderEstimator
是deprecating,所以使用OneHotEncoder
类进行一次热编码>>> oe = OneHotEncoderEstimator(inputCols=["c_idx"],outputCols=["c_idx_vec"])
>>> oe_fit = oe.fit(ff)
>>> fe = oe_fit.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
| x| c|c_idx| c_idx_vec|
+----+---+-----+-------------+
| 1.0| a| 0.0|(2,[0],[1.0])|
| 1.5| a| 0.0|(2,[0],[1.0])|
|10.0| b| 1.0|(2,[1],[1.0])|
| 3.2| c| 2.0| (2,[],[])|
+----+---+-----+-------------+
0.0
或1.0
。>>> from pyspark.sql.types dimport FloatType, IntegerType
>>> from pyspark.sql.functions import lit, udf
>>> ith = udf(lambda v, i: float(v[i]), FloatType())
>>> fx = fe
>>> for sidx, oe_col in zip([ss_fit], oe.getOutputCols()):
...
... # iterate over string values and ignore the last one
... for ii, val in list(enumerate(sidx.labels))[:-1]:
... fx = fx.withColumn(
... sidx.getInputCol() + '_' + val,
... ith(oe_col, lit(ii)).astype(IntegerType())
... )
>>> fx.show()
+----+---+-----+-------------+---+---+
| x| c|c_idx| c_idx_vec|c_a|c_b|
+----+---+-----+-------------+---+---+
| 1.0| a| 0.0|(2,[0],[1.0])| 1| 0|
| 1.5| a| 0.0|(2,[0],[1.0])| 1| 0|
|10.0| b| 1.0|(2,[1],[1.0])| 0| 1|
| 3.2| c| 2.0| (2,[],[])| 0| 0|
+----+---+-----+-------------+---+---+
请注意,默认情况下,Spark会删除最后一个类别。因此,按照行为,此处c_c
列不是必需的。