如何使用python中的spark找出每个月的总额

时间:2018-09-28 16:17:20

标签: apache-spark pyspark apache-spark-sql

我正在寻找一种按月汇总我的数据的方法。首先,我想只保留一个月的访问日期。我的DataFrame看起来像这样:

Row(visitdate = 1/1/2013, 
patientid = P1_Pt1959, 
amount = 200, 
note = jnut, 
) 

我的目标是随后按访问日期分组并计算金额的总和。我尝试了这个:

from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

file_path = "G:/Visit Data.csv"
patients = spark.read.csv(file_path,header = True)
patients.createOrReplaceTempView("visitdate")

sqlDF = spark.sql("SELECT visitdate,SUM(amount) as totalamount from visitdate GROUP BY visitdate")
sqlDF.show()

这是结果:

visitdate|totalamount|
+----------+-----------+
|  9/1/2013|    10800.0|
|25/04/2013|    12440.0|
|27/03/2014|    16930.0|
|26/03/2015|    18560.0|
|14/05/2013|    13770.0|
|30/06/2013|    13880.0

我的目标是得到这样的东西:

  visitdate|totalamount|
+----------+-----------+
|1/1/2013|    10800.0|
|1/2/2013|    12440.0|
|1/3/2013|    16930.0|
|1/4/2014|    18560.0|
|1/5/2015|    13770.0|
|1/6/2015|    13880.0|

1 个答案:

答案 0 :(得分:0)

您需要将日期缩短至几个月,以便它们正确分组,然后再进行分组/总和。有一个spark函数可以为您调用date_trunc。例如。

from datetime import date
from pyspark.sql.functions import date_trunc, sum

data = [
    (date(2000, 1, 2), 1000),
    (date(2000, 1, 2), 2000),
    (date(2000, 2, 3), 3000),
    (date(2000, 2, 4), 4000),
]

df = spark.createDataFrame(sc.parallelize(data), ["date", "amount"])

df.groupBy(date_trunc("month", df.date)).agg(sum("amount"))

+-----------------------+-----------+
|date_trunc(month, date)|sum(amount)|
+-----------------------+-----------+
|    2000-01-01 00:00:00|       3000|
|    2000-02-01 00:00:00|       7000|
+-----------------------+-----------+