Python Spark如何使用RDD API按组查找累积总和

时间:2017-03-20 16:41:18

标签: python apache-spark pyspark rdd

我是新手来编程。需要帮助spark python程序,我有这样的输入数据,并希望得到每个组的累积摘要。感谢有人指导我这个。

输入数据:

11,1,1,100

11,1,2,150

12,1,1,50

12,2,1,70

12,2,2,20

输出数据需要如下:

11,1,1,100

11,1,2,250 //(100 + 150)

12,1,1,50

12,2,1,70

12,2,2,90 //(70 + 20)

我试过的代码:

def parseline(line):
    fields = line.split(",")
    f1 = float(fields[0])
    f2 = float(fields[1])
    f3 = float(fields[2])
    f4 = float(fields[3])
    return (f1, f2, f3, f4)

input = sc.textFile("FIle:///...../a.dat")
line = input.map(parseline)
linesorted = line.sortBy(lambda x: (x[0], x[1], x[2]))
runningpremium = linesorted.map(lambda y: (((y[0], y[1]),     y[3])).reduceByKey(lambda accum, num: accum + num)

for i in runningpremium.collect():
      print i

2 个答案:

答案 0 :(得分:2)

在注释中,您可以使用窗口函数在Spark Dataframe上执行累积求和。首先,我们可以使用dummie列'a', 'b', 'c', 'd'

创建示例数据框
ls = [(11,1,1,100), (11,1,2,150), (12,1,1,50), (12,2,1,70), (12,2,2,20)]
ls_rdd = spark.sparkContext.parallelize(ls)
df = spark.createDataFrame(ls_rdd, schema=['a', 'b', 'c', 'd'])

您可以按列ab进行分区,然后按列c排序。然后,将sum函数应用于末尾的d

from pyspark.sql.window import Window
import pyspark.sql.functions as func

w = Window.partitionBy([df['a'], df['b']]).orderBy(df['c'].asc())
df_cumsum = df.select('a', 'b', 'c', func.sum(df.d).over(w).alias('cum_sum'))
df_cumsum.sort(['a', 'b', 'c']).show() # simple sort column

<强>输出

+---+---+---+-------+
|  a|  b|  c|cum_sum|
+---+---+---+-------+
| 11|  1|  1|    100|
| 11|  1|  2|    250|
| 12|  1|  1|     50|
| 12|  2|  1|     70|
| 12|  2|  2|     90|
+---+---+---+-------+

答案 1 :(得分:0)

使用Dataframe API

from pyspark.sql.types import StructType, StringType, LongType,StructField
from pyspark import SparkConf,SparkContext
from pyspark.sql import SparkSession
sc= spark.sparkContext

rdd = sc.parallelize([(11, 100),(11, 150),(12, 50),(12, 70),(12, 20)])

schema = StructType([
    StructField("id", StringType()),
    StructField("amount", LongType())
    ])

df = spark.createDataFrame(rdd, schema)

df.registerTempTable("amount_table")
df.show();
df2 = spark.sql("SELECT id,amount, sum(amount) OVER (PARTITION BY id ORDER BY amount) as cumulative_sum FROM amount_table")
df2.show()

使用RDD API 试试这个:

rdd = sc.parallelize([(11, 1, 2, 100), (11, 2, 1, 150), (12, 1, 2, 50), (12, 1, 3, 70), (12, 3, 4, 20)])

def get_key_value(rec):
    # for grouping as key value
    return rec[0], rec[1:]

from itertools import accumulate

def cumsum(values):
    return [k[0]+[k[1]] for k in zip([[i[0],i[1]] for i in values], accumulate([i[2] for i in values]))]

print(rdd.map(get_key_value).collect()) # output after get_key_value
print(rdd.map(get_key_value).groupByKey().mapValues(cumsum).flatMapValues(lambda x:x).map(lambda x: [x[0]]+x[1]).collect())

输出:

[(11, (1, 2, 100)), (11, (2, 1, 150)), (12, (1, 2, 50)), (12, (1, 3, 70)), (12, (3, 4, 20))]
[[11, 1, 2, 100], [11, 2, 1, 250], [12, 1, 2, 50], [12, 1, 3, 120], [12, 3, 4, 140]]

一个更简单的例子只涉及两列(每条记录中有两个值)

rdd=sc.parallelize([(11, 100), (11, 150), (12, 50), (12, 70), (12, 20)])
from itertools import accumulate

def cumsum(values):
    return list(accumulate(values))
print(rdd.groupByKey().mapValues(cumsum).collect())
print(rdd.groupByKey().mapValues(cumsum).flatMapValues(lambda x:x).collect())

输出:

[(11, [100, 250]), (12, [50, 120, 140])]
[(11, 100), (11, 250), (12, 50), (12, 120), (12, 140)]