我目前正在尝试从PySpark DataFrame的时间戳差异列中获取最大值和最小值。
到目前为止,我已经能够计算两行之间的时间戳差异(有关更多详细信息,请参见this link,这对我们中的某些人可能有用)。
#The DataFrame tmp_df2 is as follow :
tmp_df2.show()
+----+---------+---+----------+--------------+---------+
| id| value|_c3| timestamp|prev_timestamp|time_diff|
+----+---------+---+----------+--------------+---------+
|7564|2.70412E7| 0|1498867200| null| NaN|
|7564|2.70412E7| 0|1498867800| 1498867200| 600.0|
|7564|2.70404E7| 0|1498868400| 1498867800| 600.0|
|7564|2.70405E7| 0|1498869000| 1498868400| 600.0|
|7564|2.70404E7| 0|1498869600| 1498869000| 600.0|
|7564|2.70403E7| 0|1498870200| 1498869600| 600.0|
|7564|2.70403E7| 0|1498870800| 1498870200| 600.0|
#Checking the column types gives the following result :
tmp_df2.dtypes
[('id', 'int'), ('value', 'double'), ('_c3', 'int'), ('timestamp',
'int'), ('prev_timestamp', 'int'), ('time_diff', 'double')]
然后我尝试使用以下方法计算time_diff列的最小值和最大值:
from pyspark.sql import functions as F
tmp_max = F.max(tmp_df2.time_diff)
print(tmp_max)
Column<max(time_diff)> #I am expecting the actual max value instead
tmp_min = F.min(tmp_df2.time_diff)
print(tmp_min)
Column<min(time_diff)> #I am expecting the actual min value instead
对该主题的任何帮助将不胜感激。
谢谢:)
# & Last but not least
spark.version
'2.3.0.2.6.5.0-292'