Question

我想根据pyspark中的某些条件创建一个新列。我的数据框-

    id          create_date                txn_date
    1           2019-02-23 23:27:42        2019-08-18 00:00:00
    2           2019-08-24 00:10:18        2019-08-24 00:00:00
    3           2019-09-16 17:47:56        2018-07-23 00:00:00
    4           2019-09-24 01:31:21        2018-05-13 00:00:00
    5           2018-12-26 23:28:09        2019-07-15 00:00:00

所有列均为字符串格式。我的状况是-

txn_date> =创建日期。根据这种情况，我将创建一个新列“ is_mem”。

我的最终数据框看起来像-

    id          create_date                txn_date                    is_mem
    1           2019-02-23 23:27:42        2019-08-18 00:00:00           0
    2           2019-08-24 00:10:18        2019-09-24 00:00:00           1
    3           2019-09-16 17:47:56        2018-07-23 00:00:00           1
    4           2019-09-24 01:31:21        2018-05-13 00:00:00           1
    5           2018-12-26 23:28:09        2019-07-15 00:00:00           0

如何在pyspark中做到这一点？

Answer 1

from pyspark.sql.types import * 
import pyspark.sql.functions as F

schema1 = StructType([StructField('id', IntegerType(), True),
                     StructField('create_date', StringType(), True),
                     StructField('txn_date', StringType(), True)])

data1  = [
(1,'2019-02-23 23:27:42','2019-08-18 00:00:00' ),
(2,'2019-08-24 00:10:18','2019-08-24 00:00:00' ), 
(3,'2019-09-16 17:47:56','2018-07-23 00:00:00'), 
(4,'2019-09-24 01:31:21','2018-05-13 00:00:00'), 
(5,'2018-12-26 23:28:09','2019-07-15 00:00:00')
]

df=spark.createDataFrame(data1,schema1)

df.withColumn("is_mem",F.when(df['txn_date']>=df['create_date'],'0').otherwise('1')).show()

+---+-------------------+-------------------+----------+
| id|        create_date|           txn_date|    is_mem|
+---+-------------------+-------------------+----------+
|  1|2019-02-23 23:27:42|2019-08-18 00:00:00|         0|
|  2|2019-08-24 00:10:18|2019-08-24 00:00:00|         1|
|  3|2019-09-16 17:47:56|2018-07-23 00:00:00|         1|
|  4|2019-09-24 01:31:21|2018-05-13 00:00:00|         1|
|  5|2018-12-26 23:28:09|2019-07-15 00:00:00|         0|
+---+-------------------+-------------------+----------+

基于pyspark中的date列的条件语句

1 个答案: