数据:
Profit Amount Rate Accunt Status Yr
0.3065 56999 1 Acc3 S1 1
0.3956 57000 1 Acc3 S1 1
0.3065 57001 1 Acc3 S1 1
0.3956 57002 1 Acc3 S1 1
0.3065 57003 1 Acc3 S1 2
0.3065 57004 0.89655 Acc3 S1 3
0.3956 57005 0.89655 Acc3 S1 3
0.2984 57006 0.89655 Acc3 S1 3
0.3956 57007 1 Acc3 S2 2
0.3956 57008 1 Acc3 S2 2
0.2984 57009 1 Acc3 S2 2
0.2984 57010 1 Acc1 S1 1
0.3956 57011 1 Acc1 S1 1
0.3065 57012 1 Acc1 S1 1
0.3065 57013 1 Acc1 S1 1
0.3065 57013 1 Acc1 S1 1
代码:
df = df1\
.join(df12,(df12.code == df2.code),how = 'left').drop(df2.code).filter(col('Date') == '20Jan2019')\
.join(df3,df1.id== df3.id,how = 'left').drop(df3.id)\
.join(df4,df1.id == df4.id,how = 'left').drop(df4.id)\
.join(df5,df1.id2 == df5.id2,how ='left').drop(df5.id2)\
.withColumn("Account",concat(trim(df3.name1),trim(df4name1)))\
.withColumn("Status",when(df1.FB_Ind == 1,"S1").otherwise("S2"))\
.withColumn('Year',((df1['date'].substr(6, 4))+df1['Year']))
df6 = df.distinct()
df7 = df6.groupBy('Yr','Status','Account')\
.agg(sum((Profit * amount)/Rate).alias('output'))
我接收到的输出是小数,例如0.234,而不是千位23344.2
在pyspark中将Sum((Profit*amount)/Rate)
转换为输出代码
答案 0 :(得分:0)
这就是您应该编写代码的方式。另外,我不明白为什么要添加df1 ['Year']?
df = df1\
.join(df12,"code",how = 'left') \
.filter(col('Date') == '20Jan2019') \
.join(df3,df1.id== df3.id,how = 'left') \
.drop(df3.id)\
.join(df4,"id",how = 'left') \
.join(df5,"id2",how ='left') \
.withColumn("Account",F.concat(F.trim(df3.name1), F.trim(df4name1)))\
.withColumn("Status",F.when(df1.FB_Ind == 1, "S1").otherwise("S2"))\
.withColumn('Year',F.substr(F.col('date'), 6, 4)+F.col('Year'))
df6 = df.distinct()
df7 = df6.groupBy('Yr', 'Status', 'Account')\
.agg(F.sum(F.col("Profit") * F.col("amount"))/F.col("Rate")).alias('output'))
有关如何在pyspark中应用groupby,partitonby和其他功能的详细信息,请参阅-Analysis using Pyspark