PySpark数据框-无法从int列计算最大值和最小值

时间:2018-07-18 13:51:39

标签: python dataframe pyspark

我目前正在尝试从PySpark DataFrame的时间戳差异列中获取最大值和最小值。

到目前为止,我已经能够计算两行之间的时间戳差异(有关更多详细信息,请参见this link,这对我们中的某些人可能有用)。

#The DataFrame tmp_df2 is as follow :
tmp_df2.show()

+----+---------+---+----------+--------------+---------+
|  id|    value|_c3| timestamp|prev_timestamp|time_diff|
+----+---------+---+----------+--------------+---------+
|7564|2.70412E7|  0|1498867200|          null|      NaN|
|7564|2.70412E7|  0|1498867800|    1498867200|    600.0|
|7564|2.70404E7|  0|1498868400|    1498867800|    600.0|
|7564|2.70405E7|  0|1498869000|    1498868400|    600.0|
|7564|2.70404E7|  0|1498869600|    1498869000|    600.0|
|7564|2.70403E7|  0|1498870200|    1498869600|    600.0|
|7564|2.70403E7|  0|1498870800|    1498870200|    600.0|

#Checking the column types gives the following result :
tmp_df2.dtypes

[('id', 'int'), ('value', 'double'), ('_c3', 'int'), ('timestamp', 
'int'), ('prev_timestamp', 'int'), ('time_diff', 'double')]

然后我尝试使用以下方法计算time_diff列的最小值和最大值:

from pyspark.sql import functions as F

tmp_max = F.max(tmp_df2.time_diff)
print(tmp_max)

Column<max(time_diff)> #I am expecting the actual max value instead

tmp_min = F.min(tmp_df2.time_diff) 
print(tmp_min)

Column<min(time_diff)> #I am expecting the actual min value instead

对该主题的任何帮助将不胜感激。

谢谢:)

# & Last but not least
spark.version
'2.3.0.2.6.5.0-292'

0 个答案:

没有答案