我正在尝试将所有功能列合并为一个
所以:
assembler = VectorAssembler(
inputCols=feature_list,
outputCol='features')
其中:
feature_list
是一个包含所有功能列名称的Python列表
然后
trainingData = assembler.transform(df)
但是当我这样做时:
使用VectorAssembler的正确方法是什么?
非常感谢
答案 0 :(得分:0)
没有堆栈跟踪或df
示例,很难理解您的问题。
但是根据documentation,我仍然会回答:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
dataset = spark.createDataFrame(
[(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],
["id", "hour", "mobile", "userFeatures", "clicked"])
dataset.show()
# +---+----+------+--------------+-------+
# | id|hour|mobile| userFeatures|clicked|
# +---+----+------+--------------+-------+
# | 0| 18| 1.0|[0.0,10.0,0.5]| 1.0|
# +---+----+------+--------------+-------+
assembler = VectorAssembler(
inputCols=["hour", "mobile", "userFeatures"],
outputCol="features")
output = assembler.transform(dataset)
print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")
output.select("features", "clicked").show(truncate=False)
# +-----------------------+-------+
# |features |clicked|
# +-----------------------+-------+
# |[18.0,1.0,0.0,10.0,0.5]|1.0 |
# +-----------------------+-------+