使用python循环Spark数据帧

时间:2019-06-18 04:10:26

标签: apache-spark pyspark apache-spark-sql

我想遍历spark数据框,检查条件(即多行的合计值是true / false),然后创建一个数据框。请查看代码大纲,请您帮忙修复代码?我是火花和python的新手,可能会通过它,任何帮助将不胜感激

按工具和日期(按升序排列)进行分类交易

dfsorted = df.orderBy('Instrument','Date').show()

新的临时变量以跟踪数量总和

sumofquantity = 0 

对于dfsorted中的每一行

sumofquantity = sumofquantity + dfsorted['Quantity']

继续将循环的行追加到名为dftemp的新数据帧中

dftemp= dfsorted (how to write this?)



if ( sumofquantity == 0)

一旦总和变为零,则为tempview中的所有行添加一个具有唯一序号的新列

并将行添加到最终数据框中

dffinal= dftemp.withColumn('trade#', assign a unique trade number)

将数量总和重置为0

sumofquantity = 0

clear dftemp-如何清除数据帧,以便我可以从零行开始进行下一次迭代?

  

trade_sample.csv(原始输入文件)

Customer ID,Instrument,Action,Date,Price,Quantity 
U16,ADM6,BUY,20160516,0.7337,2
U16,ADM6,SELL,20160516,0.7337,-1
U16,ADM6,SELL,20160516,0.9439,-1
U16,CLM6,BUY,20160516,48.09,1
U16,CLM6,SELL,20160517,48.08,-1
U16,ZSM6,BUY,20160517,48.09,1
U16,ZSM6,SELL,20160518,48.08,-1
  

预期结果(请注意最后一列,这就是我要添加的所有内容)

Customer ID,Instrument,Action,Date,Price,Quantity,trade#
U16,ADM6,BUY,20160516,0.7337,2,10001
U16,ADM6,SELL,20160516,0.7337,-1,10001 
U16,ADM6,SELL,20160516,0.9439,-1,10001 
U16,CLM6,BUY,20160516,48.09,1,10002 
U16,CLM6,SELL,20160517,48.08,-1,10002 
U16,ZSM6,BUY,20160517,48.09,1,10003 
U16,ZSM6,SELL,20160518,48.08,-1,10003

1 个答案:

答案 0 :(得分:1)

以这种方式循环不是一个好习惯。您无法累积添加/求和数据帧并清除不可变的数据帧。对于您的问题,可以使用火花窗口化概念。 据我所知,您想为每个Quantity计算customer ID的总和。完成对一个客户ID的总和后,将sumofquantity重置为零。如果是这样,则可以按Customer IDInstrument的顺序对Date进行分区,并为每个Customer ID计算总和。一旦获得总和,就可以根据条件检查trade#

只需参考以下代码:

    >>> from pyspark.sql.window import Window
    >>> from pyspark.sql.functions import row_number,col,sum
    >>> w = Window.partitionBy("Customer ID").orderBy("Instrument","Date")
    >>> w1 = Window.partitionBy("Customer ID").orderBy("Instrument","Date","rn")
    >>> dftemp =  Df.withColumn("rn", (row_number().over(w))).withColumn("sumofquantity", sum("Quantity").over(w1)).select("Customer_ID","Instrument","Action","Date","Price","Quantity","sumofquantity")
    >>> dftemp.show()
+-----------+----------+------+--------+------+--------+-------------+
|Customer_ID|Instrument|Action|    Date| Price|Quantity|sumofquantity|
+-----------+----------+------+--------+------+--------+-------------+
|        U16|      ADM6|   BUY|20160516|0.7337|       2|            2|
|        U16|      ADM6|  SELL|20160516|0.7337|      -1|            1|
|        U16|      ADM6|  SELL|20160516|0.9439|      -1|            0|
|        U16|      CLM6|   BUY|20160516| 48.09|       1|            1|
|        U16|      CLM6|  SELL|20160517| 48.08|      -1|            0|
|        U16|      ZSM6|   BUY|20160517| 48.09|       1|            1|
|        U16|      ZSM6|  SELL|20160518| 48.08|      -1|            0|
+-----------+----------+------+--------+------+--------+-------------+

您可以在以下链接中引用窗口功能:

https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html