如何在Pyspark 1.6.1中集成/计算点积?

时间:2017-07-28 12:46:51

标签: pyspark integral dot-product

我在pyspark 1.6.1中有下表:

+--------+-----+--------------------+
|     key|carid|                data|
+--------+-----+--------------------+
|    time|    1|[0.2, 0.4, 0.5, 0...|
|velocity|    1|[2.0, 2.1, 2.3, 0...|
|    time|    2|[0.1, 0.35, 0.4, 0..|
|velocity|    2|[1.0, 1.1, 3.3, 0...|
|    time|    3|[0.3, 0.6, 0.7, 0...|
|velocity|    3|[2.3, 2.1, 2.3, 0...|
+--------+-----+--------------------+

那就是我有很多车,每辆车都有一个非等距时间戳的数组和一个带速度值的数组。我想计算每辆车的行驶距离:

+-----+------ -+
|carid|distance|
+-----+--------+
|    1|     100|
|    2|     102|
|    3|      85|
+-----+--------+

我想通过梯形数值积分(或简称scalar_product(diff(timestamp),velocity)来计算。如何在pyspark 1.6.1中执行此操作?

1 个答案:

答案 0 :(得分:-1)

您可以在真实数据上试用此代码,并告诉我们它是否解决了您的问题?

import numpy as np
import pyspark.sql.functions as f
from pyspark.sql.types import FloatType

df = sc.parallelize([
    ['time',    1, [0.2, 0.4, 0.5 ]],
    ['velocity',1, [2.0, 2.1, 2.3 ]],
    ['time',    2, [0.1, 0.35, 0.4]],
    ['velocity',2, [1.0, 1.1, 3.3 ]]
]).toDF(('key', 'carid', 'data'))
df.show()

df1 = df.sort('carid','key').groupby("carid").agg(f.collect_list("data").alias("timeVelocityPair"))

def modify_values(l):
    val = np.trapz(l[1], x=l[0])
    return float(val)
modified_val = f.udf(modify_values, FloatType())
final_df = df1.withColumn("distance", modified_val("timeVelocityPair")).drop("timeVelocityPair")
final_df.show()