我想在每分钟结束时获取每只股票的最新观察结果。我的高频数据帧看起来像:
+-----+--------+-------+----------+----------+----------+
|stock| date | hour | minute | second | price |
+-----+--------+-------+----------+----------+----------+
VOD | 01-02 | 10 | 13 | 11 | 85.35 |
VOD | 01-02 | 10 | 13 | 12 | 85.75 |
VOD | 01-02 | 10 | 14 | 09 | 84.35 |
VOD | 01-02 | 10 | 14 | 16 | 82.85 |
VOD | 01-02 | 10 | 14 | 26 | 85.65 |
VOD | 01-02 | 10 | 15 | 07 | 84.35 |
... ... ... .... ... ...
ABC | 01-02 | 11 | 13 | 11 | 25.35 |
ABC | 01-02 | 11 | 13 | 15 | 25.39 |
ABC | 01-02 | 11 | 13 | 19 | 25.26 |
所需的输出应该像
+-----+--------+-------+-------+-------+
|stock| date | hour | minute| Price |
+-----+--------+-------+-------+-------+
VOD | 01-02 | 10 | 13 | 85.75 |
VOD | 01-02 | 10 | 14 | 85.65 |
VOD | 01-02 | 10 | 15 | 84.35 |
VOD | 01-02 | 10 | 16 | 85.75 |
... ... ... .... ...
ABC | 01-02 | 11 | 13 | 25.26 |
我知道我可能必须使用partitionBy
和orderBy
语法来获取结果,但是我对这两者感到困惑。我熟悉SQL中的groupby
函数。我想知道哪一个更类似于groupby
函数。有人可以帮忙吗?
答案 0 :(得分:1)
我们可以使用window
功能并在 'stock', 'date', 'hour', 'minute'
上进行分区来创建新框架。
在这种情况下,我们可以按** {second
**列按descending
顺序进行排序。
然后我们只能从窗口框架中选择first row
。
Example:
df.show()
#+-----+-----+----+------+------+-----+
#|stock| date|hour|minute|second|price|
#+-----+-----+----+------+------+-----+
#| VOD|01-02| 10| 13| 11|85.35|
#| VOD|01-02| 10| 13| 12|85.75|
#| VOD|01-02| 10| 14| 09|84.35|
#| VOD|01-02| 10| 14| 16|82.85|
#| VOD|01-02| 10| 14| 26|85.65|
#+-----+-----+----+------+------+-----+
from pyspark.sql.window import Window
from pyspark.sql.functions import *
w = Window.partitionBy('stock', 'date', 'hour', 'minute').orderBy(desc('second'))
#adding rownumber to the data
df.withColumn("rn",row_number().over(w)).show()
#+-----+-----+----+------+------+-----+---+
#|stock| date|hour|minute|second|price| rn|
#+-----+-----+----+------+------+-----+---+
#| VOD|01-02| 10| 13| 12|85.75| 1|
#| VOD|01-02| 10| 13| 11|85.35| 2|
#| VOD|01-02| 10| 14| 26|85.65| 1|
#| VOD|01-02| 10| 14| 16|82.85| 2|
#| VOD|01-02| 10| 14| 09|84.35| 3|
#+-----+-----+----+------+------+-----+---+
#then select only the first row as we are ordering descending.
df.withColumn("rn",row_number().over(w)).filter(col("rn") == 1).drop("second","rn").show()
#+-----+-----+----+------+-----+
#|stock| date|hour|minute|price|
#+-----+-----+----+------+-----+
#| VOD|01-02| 10| 13|85.75|
#| VOD|01-02| 10| 14|85.65|
#+-----+-----+----+------+-----+
答案 1 :(得分:0)
经过几次反复试验。看来我找到了解决方案。只需创建一个具有价格累计值的列,然后选择价格最大的行即可。
w1(Window.partitionBy(df_trade['stock'],df_trade['date'],df_trade['hour'],df_trade['minute']).orderBy(df_trade['second']))
df1=df[['stock', 'date','hour','minute','second','price']].withColumn('subgroup',psf.sum('price').over(w1))
df1.orderBy(['stock', 'date','hour','minute','second']).show()
# create a column named subgroup to get the cumulative value of price
w=Window.partitionBy('stock', 'date','hour','minute','second')
df3=df1.withColumn('max',psf.max('subgroup').over(w)).where(psf.col('subgroup')==psf.col('max')).drop('max')
#Get the row with largest value of cumulative price
df3=df3.orderBy(['stock', 'date','hour','minute','second'], ascending=[True, True,True,True,True]).drop('subgroup')
df3=df3.withColumnRenamed('price','lastprice') # rename
df3.show()