我想遍历spark数据框,检查条件(即多行的合计值是true / false),然后创建一个数据框。请查看代码大纲,请您帮忙修复代码?我是火花和python的新手,可能会通过它,任何帮助将不胜感激
dfsorted = df.orderBy('Instrument','Date').show()
sumofquantity = 0
sumofquantity = sumofquantity + dfsorted['Quantity']
dftemp= dfsorted (how to write this?)
if ( sumofquantity == 0)
dffinal= dftemp.withColumn('trade#', assign a unique trade number)
sumofquantity = 0
trade_sample.csv(原始输入文件)
Customer ID,Instrument,Action,Date,Price,Quantity
U16,ADM6,BUY,20160516,0.7337,2
U16,ADM6,SELL,20160516,0.7337,-1
U16,ADM6,SELL,20160516,0.9439,-1
U16,CLM6,BUY,20160516,48.09,1
U16,CLM6,SELL,20160517,48.08,-1
U16,ZSM6,BUY,20160517,48.09,1
U16,ZSM6,SELL,20160518,48.08,-1
预期结果(请注意最后一列,这就是我要添加的所有内容)
Customer ID,Instrument,Action,Date,Price,Quantity,trade#
U16,ADM6,BUY,20160516,0.7337,2,10001
U16,ADM6,SELL,20160516,0.7337,-1,10001
U16,ADM6,SELL,20160516,0.9439,-1,10001
U16,CLM6,BUY,20160516,48.09,1,10002
U16,CLM6,SELL,20160517,48.08,-1,10002
U16,ZSM6,BUY,20160517,48.09,1,10003
U16,ZSM6,SELL,20160518,48.08,-1,10003
答案 0 :(得分:1)
以这种方式循环不是一个好习惯。您无法累积添加/求和数据帧并清除不可变的数据帧。对于您的问题,可以使用火花窗口化概念。
据我所知,您想为每个Quantity
计算customer ID
的总和。完成对一个客户ID的总和后,将sumofquantity
重置为零。如果是这样,则可以按Customer ID
,Instrument
的顺序对Date
进行分区,并为每个Customer ID
计算总和。一旦获得总和,就可以根据条件检查trade#
。
只需参考以下代码:
>>> from pyspark.sql.window import Window
>>> from pyspark.sql.functions import row_number,col,sum
>>> w = Window.partitionBy("Customer ID").orderBy("Instrument","Date")
>>> w1 = Window.partitionBy("Customer ID").orderBy("Instrument","Date","rn")
>>> dftemp = Df.withColumn("rn", (row_number().over(w))).withColumn("sumofquantity", sum("Quantity").over(w1)).select("Customer_ID","Instrument","Action","Date","Price","Quantity","sumofquantity")
>>> dftemp.show()
+-----------+----------+------+--------+------+--------+-------------+
|Customer_ID|Instrument|Action| Date| Price|Quantity|sumofquantity|
+-----------+----------+------+--------+------+--------+-------------+
| U16| ADM6| BUY|20160516|0.7337| 2| 2|
| U16| ADM6| SELL|20160516|0.7337| -1| 1|
| U16| ADM6| SELL|20160516|0.9439| -1| 0|
| U16| CLM6| BUY|20160516| 48.09| 1| 1|
| U16| CLM6| SELL|20160517| 48.08| -1| 0|
| U16| ZSM6| BUY|20160517| 48.09| 1| 1|
| U16| ZSM6| SELL|20160518| 48.08| -1| 0|
+-----------+----------+------+--------+------+--------+-------------+
您可以在以下链接中引用窗口功能:
https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html