例如,我在name
中具有带有分类功能的DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("example")
.config("spark.some.config.option", "some-value").getOrCreate()
features = [(['a', 'b', 'c'], 1),
(['a', 'c'], 2),
(['d'], 3),
(['b', 'c'], 4),
(['a', 'b', 'd'], 5)]
df = spark.createDataFrame(features, ['name','id'])
df.show()
出局:
+---------+----+
| name| id |
+---------+----+
|[a, b, c]| 1|
| [a, c]| 2|
| [d]| 3|
| [b, c]| 4|
|[a, b, d]| 5|
+---------+----+
我想要得到什么:
+--------+--------+--------+--------+----+
| name_a | name_b | name_c | name_d | id |
+--------+--------+--------+--------+----+
| 1 | 1 | 1 | 0 | 1 |
+--------+--------+--------+--------+----+
| 1 | 0 | 1 | 0 | 2 |
+--------+--------+--------+--------+----+
| 0 | 0 | 0 | 1 | 3 |
+--------+--------+--------+--------+----+
| 0 | 1 | 1 | 0 | 4 |
+--------+--------+--------+--------+----+
| 1 | 1 | 0 | 1 | 5 |
+--------+--------+--------+--------+----+
我找到了same queston,但没有任何帮助。
我尝试使用VectorIndexer
中的PySpark.ML
,但是将name
字段转换为vector type
时遇到了一些问题。
from pyspark.ml.feature import VectorIndexer
indexer = VectorIndexer(inputCol="name", outputCol="indexed", maxCategories=5)
indexerModel = indexer.fit(df)
我收到以下错误:
Column name must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType
我找到了一个解决方案here,但看起来太复杂了。但是,我不确定是否只能使用VectorIndexer
来完成。
答案 0 :(得分:2)
如果要在Spark ML中使用输出,最好使用CountVectorizer
:
from pyspark.ml.feature import CountVectorizer
# Add binary=True if needed
df_enc = (CountVectorizer(inputCol="name", outputCol="name_vector")
.fit(df)
.transform(df))
df_enc.show(truncate=False)
+---------+---+-------------------------+
|name |id |name_vector |
+---------+---+-------------------------+
|[a, b, c]|1 |(4,[0,1,2],[1.0,1.0,1.0])|
|[a, c] |2 |(4,[0,1],[1.0,1.0]) |
|[d] |3 |(4,[3],[1.0]) |
|[b, c] |4 |(4,[1,2],[1.0,1.0]) |
|[a, b, d]|5 |(4,[0,2,3],[1.0,1.0,1.0])|
+---------+---+-------------------------+
否则,收集不同的值:
from pyspark.sql.functions import array_contains, col, explode
names = [
x[0] for x in
df.select(explode("name").alias("name")).distinct().orderBy("name").collect()]
并选择带有array_contains
的列:
df_sep = df.select("*", *[
array_contains("name", name).alias("name_{}".format(name)).cast("integer")
for name in names]
)
df_sep.show()
+---------+---+------+------+------+------+
| name| id|name_a|name_b|name_c|name_d|
+---------+---+------+------+------+------+
|[a, b, c]| 1| 1| 1| 1| 0|
| [a, c]| 2| 1| 0| 1| 0|
| [d]| 3| 0| 0| 0| 1|
| [b, c]| 4| 0| 1| 1| 0|
|[a, b, d]| 5| 1| 1| 0| 1|
+---------+---+------+------+------+------+
答案 1 :(得分:0)
使用pyspark.sql.functions
和this question中的explode
:
from pyspark.sql import functions as F
features = [(['a', 'b', 'c'], 1),
(['a', 'c'], 2),
(['d'], 3),
(['b', 'c'], 4),
(['a', 'b', 'd'], 5)]
df = spark.createDataFrame(features, ['name','id'])
df.show()
+---------+---+
| name| id|
+---------+---+
|[a, b, c]| 1|
| [a, c]| 2|
| [d]| 3|
| [b, c]| 4|
|[a, b, d]| 5|
+---------+---+
df = df.withColumn('exploded', F.explode('name'))
df.drop('name').groupby('id').pivot('exploded').count().show()
+---+----+----+----+----+
| id| a| b| c| d|
+---+----+----+----+----+
| 5| 1| 1|null| 1|
| 1| 1| 1| 1|null|
| 3|null|null|null| 1|
| 2| 1|null| 1|null|
| 4|null| 1| 1|null|
+---+----+----+----+----+
按id
排序,然后将null
转换为0
df.drop('name').groupby('id').pivot('exploded').count().na.fill(0).sort(F.col('id').asc()).show()
+---+---+---+---+---+
| id| a| b| c| d|
+---+---+---+---+---+
| 1| 1| 1| 1| 0|
| 2| 1| 0| 1| 0|
| 3| 0| 0| 0| 1|
| 4| 0| 1| 1| 0|
| 5| 1| 1| 0| 1|
+---+---+---+---+---+
explode
为给定数组或映射中的每个元素返回新行。然后,您可以使用pivot
来“转置”新列。