使用pyspark VectorAssembler的正确方法是什么?

时间:2019-12-24 03:55:45

标签: pyspark

我正在尝试将所有功能列合并为一个

所以:

assembler = VectorAssembler(
    inputCols=feature_list,
    outputCol='features')

其中:

feature_list是一个包含所有功能列名称的Python列表

然后

trainingData = assembler.transform(df)

但是当我这样做时:

enter image description here

使用VectorAssembler的正确方法是什么?

非常感谢

1 个答案:

答案 0 :(得分:0)

没有堆栈跟踪或df示例,很难理解您的问题。

但是根据documentation,我仍然会回答:

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

dataset = spark.createDataFrame(
    [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
    ["id", "hour", "mobile", "userFeatures", "clicked"])

dataset.show()

# +---+----+------+--------------+-------+
# | id|hour|mobile|  userFeatures|clicked|
# +---+----+------+--------------+-------+
# |  0|  18|   1.0|[0.0,10.0,0.5]|    1.0|
# +---+----+------+--------------+-------+

assembler = VectorAssembler(
    inputCols=["hour", "mobile", "userFeatures"],
    outputCol="features")

output = assembler.transform(dataset)

print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")

output.select("features", "clicked").show(truncate=False)

# +-----------------------+-------+
# |features               |clicked|
# +-----------------------+-------+
# |[18.0,1.0,0.0,10.0,0.5]|1.0    |
# +-----------------------+-------+

Example Source Code