我是新手来编程。需要帮助spark python程序,我有这样的输入数据,并希望得到每个组的累积摘要。感谢有人指导我这个。
11,1,1,100
11,1,2,150
12,1,1,50
12,2,1,70
12,2,2,20
11,1,1,100
11,1,2,250 //(100 + 150)
12,1,1,50
12,2,1,70
12,2,2,90 //(70 + 20)
我试过的代码:
def parseline(line):
fields = line.split(",")
f1 = float(fields[0])
f2 = float(fields[1])
f3 = float(fields[2])
f4 = float(fields[3])
return (f1, f2, f3, f4)
input = sc.textFile("FIle:///...../a.dat")
line = input.map(parseline)
linesorted = line.sortBy(lambda x: (x[0], x[1], x[2]))
runningpremium = linesorted.map(lambda y: (((y[0], y[1]), y[3])).reduceByKey(lambda accum, num: accum + num)
for i in runningpremium.collect():
print i
答案 0 :(得分:2)
在注释中,您可以使用窗口函数在Spark Dataframe上执行累积求和。首先,我们可以使用dummie列'a', 'b', 'c', 'd'
ls = [(11,1,1,100), (11,1,2,150), (12,1,1,50), (12,2,1,70), (12,2,2,20)]
ls_rdd = spark.sparkContext.parallelize(ls)
df = spark.createDataFrame(ls_rdd, schema=['a', 'b', 'c', 'd'])
您可以按列a
和b
进行分区,然后按列c
排序。然后,将sum
函数应用于末尾的d
列
from pyspark.sql.window import Window
import pyspark.sql.functions as func
w = Window.partitionBy([df['a'], df['b']]).orderBy(df['c'].asc())
df_cumsum = df.select('a', 'b', 'c', func.sum(df.d).over(w).alias('cum_sum'))
df_cumsum.sort(['a', 'b', 'c']).show() # simple sort column
<强>输出强>
+---+---+---+-------+
| a| b| c|cum_sum|
+---+---+---+-------+
| 11| 1| 1| 100|
| 11| 1| 2| 250|
| 12| 1| 1| 50|
| 12| 2| 1| 70|
| 12| 2| 2| 90|
+---+---+---+-------+
答案 1 :(得分:0)
使用Dataframe API :
from pyspark.sql.types import StructType, StringType, LongType,StructField
from pyspark import SparkConf,SparkContext
from pyspark.sql import SparkSession
sc= spark.sparkContext
rdd = sc.parallelize([(11, 100),(11, 150),(12, 50),(12, 70),(12, 20)])
schema = StructType([
StructField("id", StringType()),
StructField("amount", LongType())
])
df = spark.createDataFrame(rdd, schema)
df.registerTempTable("amount_table")
df.show();
df2 = spark.sql("SELECT id,amount, sum(amount) OVER (PARTITION BY id ORDER BY amount) as cumulative_sum FROM amount_table")
df2.show()
使用RDD API 试试这个:
rdd = sc.parallelize([(11, 1, 2, 100), (11, 2, 1, 150), (12, 1, 2, 50), (12, 1, 3, 70), (12, 3, 4, 20)])
def get_key_value(rec):
# for grouping as key value
return rec[0], rec[1:]
from itertools import accumulate
def cumsum(values):
return [k[0]+[k[1]] for k in zip([[i[0],i[1]] for i in values], accumulate([i[2] for i in values]))]
print(rdd.map(get_key_value).collect()) # output after get_key_value
print(rdd.map(get_key_value).groupByKey().mapValues(cumsum).flatMapValues(lambda x:x).map(lambda x: [x[0]]+x[1]).collect())
输出:
[(11, (1, 2, 100)), (11, (2, 1, 150)), (12, (1, 2, 50)), (12, (1, 3, 70)), (12, (3, 4, 20))]
[[11, 1, 2, 100], [11, 2, 1, 250], [12, 1, 2, 50], [12, 1, 3, 120], [12, 3, 4, 140]]
一个更简单的例子只涉及两列(每条记录中有两个值)
rdd=sc.parallelize([(11, 100), (11, 150), (12, 50), (12, 70), (12, 20)])
from itertools import accumulate
def cumsum(values):
return list(accumulate(values))
print(rdd.groupByKey().mapValues(cumsum).collect())
print(rdd.groupByKey().mapValues(cumsum).flatMapValues(lambda x:x).collect())
输出:
[(11, [100, 250]), (12, [50, 120, 140])]
[(11, 100), (11, 250), (12, 50), (12, 120), (12, 140)]