循环并在pyspark的另一个表中查找几行

时间:2018-10-09 20:15:48

标签: python pandas pyspark

我有两个数据框, 表1:用户在第0天购买的商品 表2:x天内的商品价格(每天波动)

我想在用户购买商品价格时进行匹配。有没有更好的方法可以做到这一点,而不必循环每一行然后应用一个函数?

我的最终输出是我想知道当约翰以1/1的价格购买苹果时,rolling_average 3天平均水平是多少吗?

第一张表: John's Table(可能会有更多用户)

@Query("SELECT * FROM Product WHERE Name = :name AND Document.name = :documentName")
List<Product> getProducts(String name, String documentName);

参考表:价格表

Date    Item    Price
1/1/2018    Apple   1
2/14/2018   Grapes  1.99
1/25/2018   Pineapple   1.5
5/25/2018   Apple   0.98

Apple示例:

Date    Item    Price
1/1/2018    Apple   1
1/2/2018    Apple   0.98
1/3/2018    Apple   0.88
1/4/2018    Apple   1.2
1/5/2018    Apple   1.3
1/6/2018    Apple   1.5
1/7/2018    Apple   1.05
1/8/2018    Apple   1.025
2/10/2018   Grapes  3.10
2/11/2018   Grapes  0.10
2/12/2018   Grapes  5.00
2/13/2018   Grapes  0.40
2/14/2018   Grapes  1.00
2/15/2018   Grapes  2.70
2/16/2018   Grapes  0.40
2/17/2018   Grapes  0.40
1/23/2018   Pineapple   0.50
1/24/2018   Pineapple   0.60
1/25/2018   Pineapple   0.70
1/26/2018   Pineapple   0.60
1/27/2018   Pineapple   0.60
1/28/2018   Pineapple   0.50
1/29/2018   Pineapple   0.70
1/30/2018   Pineapple   0.50
5/21/2018   Apple   7.00
5/22/2018   Apple   6.00
5/23/2018   Apple   5.00
5/24/2018   Apple   6.00
5/25/2018   Apple   5.00

2 个答案:

答案 0 :(得分:0)

因此,如果我对问题的理解正确,那么您希望为每个项目计算3天的平均值。然后,您只需将表1和表2连接起来,即可获得平均价格仅次于实际价格的已售出商品。 您可以使用窗口功能执行此操作。 在pyspark中,可能是这样的:

import pyspark.sql.functions as F
from pyspark.sql.window import Window

df_price = df_price.withColumn(
    'rolling_average',
    F.avg(df_price.price).over(
        Window.partitionBy(df_price.item).orderBy(
            df_price.date.desc()
        ).rowsBetween(0, 3)
    )
)

然后,您只需将表加入此操作即可。 在SQL中将是这样:

WITH b as (
SELECT '1/1/2018' as date_p,  'Apple' as item, 1 as price
UNION ALL SELECT '1/2/2018' as date_p,  'Apple' as item, 0.98 as price
UNION ALL SELECT '1/3/2018' as date_p,  'Apple' as item, 0.88 as price
UNION ALL SELECT '1/4/2018' as date_p,  'Apple' as item, 1.2 as price
UNION ALL SELECT '1/5/2018' as date_p,  'Apple' as item, 1.3 as price
UNION ALL SELECT '1/6/2018' as date_p,  'Apple' as item, 1.5 as price
UNION ALL SELECT '1/7/2018' as date_p,  'Apple' as item, 1.05 as price
UNION ALL SELECT '1/8/2018' as date_p,  'Apple' as item, 1.025 as price
UNION ALL SELECT '2/10/2018' as date_p, 'Grape' as item, 3.10 as price)
SELECT *, AVG(price) OVER (
  PARTITION BY item ORDER BY date_p DESC
  ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) from b

答案 1 :(得分:0)

如果您只想按特定项目分组(将价格表设置为df2):

df2['Date'] = pd.to_datetime(df2['Date'])
df2 = df2.set_index('Date')

df2['Rolling'] = df2[df2['Item']=='Apple']['Price'].rolling(3).mean()

打印df2[df2['Item']=='Apple']将产生:

             Item  Price   Rolling
Date                              
2018-01-01  Apple  1.000       NaN
2018-01-02  Apple  0.980       NaN
2018-01-03  Apple  0.880  0.953333
2018-01-04  Apple  1.200  1.020000
2018-01-05  Apple  1.300  1.126667
2018-01-06  Apple  1.500  1.333333
2018-01-07  Apple  1.050  1.283333
2018-01-08  Apple  1.025  1.191667
2018-05-21  Apple  7.000  3.025000
2018-05-22  Apple  6.000  4.675000
2018-05-23  Apple  5.000  6.000000
2018-05-24  Apple  6.000  5.666667
2018-05-25  Apple  5.000  5.333333

如果要限制为某些日期分组,答案会稍有不同。