新创建的列在pyspark数据框中显示空值

时间:2020-10-14 20:21:26

标签: pyspark apache-spark-sql

我想添加一列来计算两个两个时间戳值之间的时间差。为此,我首先添加一列当前日期时间,在此处将其定义为current_datetime

import datetime
#define current datetime
now = datetime.datetime.now()
#Getting Current date and time
current_datetime=now.strftime("%Y-%m-%d %H:%M:%S")
print(now)

然后我想将current_datetime作为列值添加到df并计算差异

import pyspark.sql.functions as F

productsDF = productsDF\
.withColumn('current_time', when(col('Quantity')>1, current_datetime))\
.withColumn('time_diff',\
    (F.unix_timestamp(F.to_timestamp(F.col('current_time')))) - 
    (F.unix_timestamp(F.to_timestamp(F.col('Created_datetime'))))/F.lit(3600)
)

但是输出仅为空值。

productsDF.select('current_time','Created_datetime','time_diff').show()

+------------+-------------------+---------+
|current_time|   Created_datetime|time_diff|
+------------+-------------------+---------+
|        null|2019-10-12 17:09:18|     null|
|        null|2019-12-03 07:02:07|     null|
|        null|2020-01-16 23:10:08|     null|
|        null|2020-01-21 15:38:39|     null|
|        null|2020-01-21 15:14:55|     null|

使用string和double类型创建新列:

 |-- current_time: string (nullable = true)
 |-- diff: double (nullable = true)
 |-- time_diff: double (nullable = true)

我尝试创建具有字符串和文字值的列以进行测试,但是输出始终为null。我想念什么?

1 个答案:

答案 0 :(得分:1)

要用<p id="para">Lorem Ipsum Lorem Ipsum Lorem Ipsum Lorem IpsumLorem Ipsum Lorem IpsumLorem Ipsum Lorem IpsumLorem Ipsum Lorem IpsumLorem Ipsum Lorem IpsumLorem Ipsum Lorem IpsumLorem Ipsum Lorem Ipsum Lorem Ipsum Lorem Ipsum </p>填充一列,您缺少了current_datetime函数:

lit()

要计算两个current_datetime = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") productsDF = productsDF.withColumn("current_time", lit(current_datetime)) 列之间的时差,可以执行以下操作:

timestamp

编辑:

对于小时,天,月和年的时差,您可以执行以下操作:

productsDF.withColumn('time_diff',(F.unix_timestamp('current_time') - 
    F.unix_timestamp('Created_datetime'))/3600).show()

如果您希望精确的时差,那么:

df.withColumn('time_diff_hours',(F.unix_timestamp('current_time') - F.unix_timestamp('Created_datetime'))/3600)\
    .withColumn("time_diff_days", datediff(col("current_time"),col("Created_datetime")))\
    .withColumn("time_diff_months", months_between(col("current_time"),col("Created_datetime")))\
    .withColumn("time_diff_years", year(col("current_time")) - year(col("Created_datetime"))).show()

+-------------------+-------------------+------------------+--------------+----------------+---------------+
|   Created_datetime|       current_time|   time_diff_hours|time_diff_days|time_diff_months|time_diff_years|
+-------------------+-------------------+------------------+--------------+----------------+---------------+
|2019-10-12 17:09:18|2020-10-15 02:45:49|  8841.60861111111|           369|     12.07743093|              1|
|2019-12-03 07:02:07|2020-10-15 02:45:49|7602.7283333333335|           317|     10.38135529|              1|
|2020-01-16 23:10:08|2020-10-15 02:45:49| 6530.594722222222|           273|      8.94031549|              0|
+-------------------+-------------------+------------------+--------------+----------------+---------------+