PySpark中密集向量的逐元素减法

时间:2018-08-27 10:59:17

标签: pyspark apache-spark-ml

我有一个Spark Dataframe,其中有两列是密集向量。我想计算每个向量(对于数据帧的每一行)的元素的逐元素差异。

我该怎么做?

1 个答案:

答案 0 :(得分:2)

即使可以用u-v减去两个密集向量u和v,也不能使用col1-col2减去两列密集向量。

因此,我将使用udf:

from pyspark.sql import functions as F
from pyspark.ml.linalg import DenseVector, VectorUDT

df = sqlContext.createDataFrame([ 
        [DenseVector([1.,1.]), DenseVector([0.,0.])],
        [DenseVector([1.,1.]), DenseVector([1.,0.])],
        [DenseVector([1.,1.]), DenseVector([1.,1.])]
    ], ['u', 'v'])

subtract_vector_udf = F.udf(lambda arr: arr[0]-arr[1], VectorUDT())

df2 = df.select('*', subtract_vector_udf(F.array('u', 'v')).alias('diff'))
df2.show()
>>>
+---------+---------+---------+
|        u|        v|     diff|
+---------+---------+---------+
|[1.0,1.0]|[0.0,0.0]|[1.0,1.0]|
|[1.0,1.0]|[1.0,0.0]|[0.0,1.0]|
|[1.0,1.0]|[1.0,1.0]|[0.0,0.0]|
+---------+---------+---------+