我在pyspark 1.6.1中有下表:
+--------+-----+--------------------+
| key|carid| data|
+--------+-----+--------------------+
| time| 1|[0.2, 0.4, 0.5, 0...|
|velocity| 1|[2.0, 2.1, 2.3, 0...|
| time| 2|[0.1, 0.35, 0.4, 0..|
|velocity| 2|[1.0, 1.1, 3.3, 0...|
| time| 3|[0.3, 0.6, 0.7, 0...|
|velocity| 3|[2.3, 2.1, 2.3, 0...|
+--------+-----+--------------------+
那就是我有很多车,每辆车都有一个非等距时间戳的数组和一个带速度值的数组。我想计算每辆车的行驶距离:
+-----+------ -+
|carid|distance|
+-----+--------+
| 1| 100|
| 2| 102|
| 3| 85|
+-----+--------+
我想通过梯形数值积分(或简称scalar_product(diff(timestamp),velocity)来计算。如何在pyspark 1.6.1中执行此操作?
答案 0 :(得分:-1)
您可以在真实数据上试用此代码,并告诉我们它是否解决了您的问题?
import numpy as np
import pyspark.sql.functions as f
from pyspark.sql.types import FloatType
df = sc.parallelize([
['time', 1, [0.2, 0.4, 0.5 ]],
['velocity',1, [2.0, 2.1, 2.3 ]],
['time', 2, [0.1, 0.35, 0.4]],
['velocity',2, [1.0, 1.1, 3.3 ]]
]).toDF(('key', 'carid', 'data'))
df.show()
df1 = df.sort('carid','key').groupby("carid").agg(f.collect_list("data").alias("timeVelocityPair"))
def modify_values(l):
val = np.trapz(l[1], x=l[0])
return float(val)
modified_val = f.udf(modify_values, FloatType())
final_df = df1.withColumn("distance", modified_val("timeVelocityPair")).drop("timeVelocityPair")
final_df.show()