我有一个简单的python问题。我有杂志订阅的DataFrame,如下所示:
SubId UserID Created Expired
09483 2938 10/9/2018 N/A
03824 3899 10/13/2018 N/A
02853 0838 10/29/2017 10/1/2018
08992 2938 10/2/2018 10/8/2018
我想创建一个新的布尔值列,以检查相同的UserID是否有一个先前的订阅在此新订阅开始之前(<5天)结束
SubId UserID Created Expired Extension_of_Sub
09483 2938 10/9/2018 N/A 1
03824 3899 10/13/2018 N/A 0
02853 0838 10/29/2017 10/1/2018 0
08992 2938 10/2/2018 10/8/2018 0
我该怎么做?换句话说,我正在尝试获得一个更准确的“搅局”数,从俯冲到一本杂志与另一本杂志的切换可能不会像迁移那样多。
谢谢!
答案 0 :(得分:0)
您可以通过联接以及pyspark.sql.functions datediff和when来实现。请在下面找到评论示例:
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
l = [('09483', '2938', '10/9/2018', None )
,('03824', '3899', '10/13/2018', None)
,('02853', '0838', '10/29/2017', '10/1/2018')
,('08992', '2938', '10/2/2018', '10/8/2018')
]
df = spark.createDataFrame(l,['SubId','UserID','Created','Expired'])
#This cast's the Expired and Created columns to columns of the type date
df = df.withColumn("Expired", F.to_date(df.Expired, 'MM/dd/yyyy'))
df = df.withColumn("Created", F.to_date(df.Created, 'MM/dd/yyyy'))
#Our join will create a dataframe which has two columns with the name UserID. This list will eliminiate the ambiguity
newNames = ['SubId','UserID', 'Created', 'Expired', 'dropUserID', 'dropExpired', 'Extension_of_Sub']
#We create a dateframe with a new column that contains the max Expired date per UserID
tmp = df.select(df.UserID, df.Expired).groupBy(df.UserID).agg(F.max("Expired").alias('maxExpired'))
#This is the join condition, which joins only rows with running subscriptions
cond = [df.UserID == tmp.UserID, df.Expired.isNull()]
#We use a left to join as we also want to keep the expired subscriptions in the dataframe
df = df.join(tmp, cond, 'left')
#...and finally with can fill the column Extension_of_Sub when the difference between the creation date of a new subscription and the expiry date of an former subscription is less then 5 days.
df.withColumn('Extension_of_Sub', F.when(F.datediff(df.Created, df.maxExpired) < 5, 1).otherwise(0)).toDF(*newNames).drop('dropUserID', 'dropExpired').show()