将向量列添加到pyspark DataFrame

时间:2018-04-14 15:06:57

标签: apache-spark dataframe pyspark apache-spark-ml

如何将Vectors.dense列添加到pyspark数据框?

import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.linalg import DenseVector

py_df = pd.DataFrame.from_dict({"time": [59., 115., 156., 421.], "event": [1, 1, 1, 0]})

sc = SparkContext(master="local")
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(py_df)
sdf.withColumn("features", DenseVector(1))

在文件anaconda3/lib/python3.6/site-packages/pyspark/sql/dataframe.py第1848行中出现错误:

AssertionError: col should be Column

它不喜欢DenseVector类型作为列。基本上,我有一个pandas数据框,我想将其转换为pyspark数据帧并添加Vectors.dense类型的列。还有另一种方法吗?

1 个答案:

答案 0 :(得分:5)

常量Vectors无法添加为文字。您必须使用udf

from pyspark.sql.functions import udf
from pyspark.ml.linalg import VectorUDT

one = udf(lambda: DenseVector([1]), VectorUDT())
sdf.withColumn("features", one()).show()

但我不确定你为什么要这样做。如果您想将现有列转换为Vectors,请使用相应的pyspark.ml工具,例如VectorAssembler - Encode and assemble multiple features in PySpark

from pyspark.ml.feature import VectorAssembler

VectorAssembler(inputCols=["time"], outputCol="features").transform(sdf)