我知道我们可以使用Window function in pyspark来计算累积金额。但Window仅在HiveContext中支持,而不在SQLContext中支持。我需要使用SQLContext,因为HiveContext不能在多个进程中运行。
有没有有效的方法使用SQLContext计算累积和?一种简单的方法是将数据加载到驱动程序的内存中并使用numpy.cumsum,但con是需要能够装入内存的数据
答案 0 :(得分:9)
不确定这是否是您正在寻找的内容,但以下是如何使用sqlContext计算累积总和的两个示例:
首先,当您想按某些类别对其进行分区时:
from pyspark.sql.types import StructType, StringType, LongType
from pyspark.sql import SQLContext
rdd = sc.parallelize([
("Tablet", 6500),
("Tablet", 5500),
("Cell Phone", 6000),
("Cell Phone", 6500),
("Cell Phone", 5500)
])
schema = StructType([
StructField("category", StringType(), False),
StructField("revenue", LongType(), False)
])
df = sqlContext.createDataFrame(rdd, schema)
df.registerTempTable("test_table")
df2 = sqlContext.sql("""
SELECT
category,
revenue,
sum(revenue) OVER (PARTITION BY category ORDER BY revenue) as cumsum
FROM
test_table
""")
输出:
[Row(category='Tablet', revenue=5500, cumsum=5500),
Row(category='Tablet', revenue=6500, cumsum=12000),
Row(category='Cell Phone', revenue=5500, cumsum=5500),
Row(category='Cell Phone', revenue=6000, cumsum=11500),
Row(category='Cell Phone', revenue=6500, cumsum=18000)]
第二,当你只想取一个变量的cumsum时。将df2更改为:
df2 = sqlContext.sql("""
SELECT
category,
revenue,
sum(revenue) OVER (ORDER BY revenue, category) as cumsum
FROM
test_table
""")
输出:
[Row(category='Cell Phone', revenue=5500, cumsum=5500),
Row(category='Tablet', revenue=5500, cumsum=11000),
Row(category='Cell Phone', revenue=6000, cumsum=17000),
Row(category='Cell Phone', revenue=6500, cumsum=23500),
Row(category='Tablet', revenue=6500, cumsum=30000)]
希望这会有所帮助。收集数据后使用np.cumsum效率不高,尤其是在数据集很大的情况下。您可以探索的另一种方法是使用简单的RDD转换,例如groupByKey(),然后使用map通过某个键计算每个组的累积总和,然后在最后减少它。
答案 1 :(得分:5)
这是一个简单的例子:
import pyspark
from pyspark.sql import window
import pyspark.sql.functions as sf
sc = pyspark.SparkContext(appName="test")
sqlcontext = pyspark.SQLContext(sc)
data = sqlcontext.createDataFrame([("Bob", "M", "Boston", 1, 20),
("Cam", "F", "Cambridge", 1, 25),
("Lin", "F", "Cambridge", 1, 25),
("Cat", "M", "Boston", 1, 20),
("Sara", "F", "Cambridge", 1, 15),
("Jeff", "M", "Cambridge", 1, 25),
("Bean", "M", "Cambridge", 1, 26),
("Dave", "M", "Cambridge", 1, 21),],
["name", 'gender', "city", 'donation', "age"])
data.show()
给出输出
+----+------+---------+--------+---+
|name|gender| city|donation|age|
+----+------+---------+--------+---+
| Bob| M| Boston| 1| 20|
| Cam| F|Cambridge| 1| 25|
| Lin| F|Cambridge| 1| 25|
| Cat| M| Boston| 1| 20|
|Sara| F|Cambridge| 1| 15|
|Jeff| M|Cambridge| 1| 25|
|Bean| M|Cambridge| 1| 26|
|Dave| M|Cambridge| 1| 21|
+----+------+---------+--------+---+
定义一个窗口
win_spec = (window.Window
.partitionBy(['gender', 'city'])
.rowsBetween(window.Window.unboundedPreceding, 0))
#window.Window.unboundedPreceding - 组的第一行
#.rowsBetween(...,0) - 0
指的是当前行,如果指定为-2
则在当前行之前最多2行
现在,这是一个陷阱:
temp = data.withColumn('cumsum',sum(data.donation).over(win_spec))
有错误:
TypeErrorTraceback (most recent call last)
<ipython-input-9-b467d24b05cd> in <module>()
----> 1 temp = data.withColumn('cumsum',sum(data.donation).over(win_spec))
/Users/mupadhye/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.pyc in __iter__(self)
238
239 def __iter__(self):
--> 240 raise TypeError("Column is not iterable")
241
242 # string methods
TypeError: Column is not iterable
这是因为使用了python的sum
函数而不是pyspark's
。解决此问题的方法是使用sum
中的pyspark.sql.functions.sum
函数:
temp = data.withColumn('AgeSum',sf.sum(data.donation).over(win_spec))
temp.show()
会给:
+----+------+---------+--------+---+--------------+
|name|gender| city|donation|age|CumSumDonation|
+----+------+---------+--------+---+--------------+
|Sara| F|Cambridge| 1| 15| 1|
| Cam| F|Cambridge| 1| 25| 2|
| Lin| F|Cambridge| 1| 25| 3|
| Bob| M| Boston| 1| 20| 1|
| Cat| M| Boston| 1| 20| 2|
|Dave| M|Cambridge| 1| 21| 1|
|Jeff| M|Cambridge| 1| 25| 2|
|Bean| M|Cambridge| 1| 26| 3|
+----+------+---------+--------+---+--------------+
答案 2 :(得分:1)
登陆此线程试图解决类似的问题后,我已经使用此代码解决了我的问题。不确定我是否遗漏了部分OP,但这是一种总结SQLContext
列的方法:
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql.context import SQLContext
sc = SparkContext()
sc.setLogLevel("ERROR")
conf = SparkConf()
conf.setAppName('Sum SQLContext Column')
conf.set("spark.executor.memory", "2g")
sqlContext = SQLContext(sc)
def sum_column(table, column):
sc_table = sqlContext.table(table)
return sc_table.agg({column: "sum"})
sum_column("db.tablename", "column").show()
答案 3 :(得分:0)
Windows功能仅适用于HiveContext。您甚至可以使用 sqlContext :
来使用它们from pyspark.sql.window import *
myPartition=Window.partitionBy(['col1','col2','col3'])
temp= temp.withColumn("#dummy",sum(temp.col4).over(myPartition))