pyspark groupBy-乘法和除法给出错误的结果

时间:2020-05-29 22:48:08

标签: pyspark group-by apache-spark-sql aggregate-functions

数据:

Profit  Amount  Rate    Accunt  Status  Yr
0.3065  56999   1   Acc3    S1  1
0.3956  57000   1   Acc3    S1  1
0.3065  57001   1   Acc3    S1  1
0.3956  57002   1   Acc3    S1  1
0.3065  57003   1   Acc3    S1  2
0.3065  57004   0.89655 Acc3    S1  3
0.3956  57005   0.89655 Acc3    S1  3
0.2984  57006   0.89655 Acc3    S1  3
0.3956  57007   1   Acc3    S2  2
0.3956  57008   1   Acc3    S2  2
0.2984  57009   1   Acc3    S2  2
0.2984  57010   1   Acc1    S1  1
0.3956  57011   1   Acc1    S1  1
0.3065  57012   1   Acc1    S1  1
0.3065  57013   1   Acc1    S1  1
0.3065  57013   1   Acc1    S1  1

代码:

df = df1\
.join(df12,(df12.code == df2.code),how = 'left').drop(df2.code).filter(col('Date') == '20Jan2019')\
.join(df3,df1.id== df3.id,how = 'left').drop(df3.id)\
.join(df4,df1.id == df4.id,how = 'left').drop(df4.id)\
.join(df5,df1.id2 == df5.id2,how ='left').drop(df5.id2)\
.withColumn("Account",concat(trim(df3.name1),trim(df4name1)))\
.withColumn("Status",when(df1.FB_Ind == 1,"S1").otherwise("S2"))\
.withColumn('Year',((df1['date'].substr(6, 4))+df1['Year']))

df6 = df.distinct()
df7 = df6.groupBy('Yr','Status','Account')\
.agg(sum((Profit * amount)/Rate).alias('output'))

我接收到的输出是小数,例如0.234,而不是千位23344.2 在pyspark中将Sum((Profit*amount)/Rate)转换为输出代码

1 个答案:

答案 0 :(得分:0)

这就是您应该编写代码的方式。另外,我不明白为什么要添加df1 ['Year']?

df = df1\
.join(df12,"code",how = 'left') \
.filter(col('Date') == '20Jan2019') \
.join(df3,df1.id== df3.id,how = 'left') \
.drop(df3.id)\
.join(df4,"id",how = 'left') \
.join(df5,"id2",how ='left') \
.withColumn("Account",F.concat(F.trim(df3.name1), F.trim(df4name1)))\
.withColumn("Status",F.when(df1.FB_Ind == 1, "S1").otherwise("S2"))\
.withColumn('Year',F.substr(F.col('date'), 6, 4)+F.col('Year'))

df6 = df.distinct()
df7 = df6.groupBy('Yr', 'Status', 'Account')\
         .agg(F.sum(F.col("Profit") * F.col("amount"))/F.col("Rate")).alias('output'))

有关如何在pyspark中应用groupby,partitonby和其他功能的详细信息,请参阅-Analysis using Pyspark