Question

如何在groupby中滚动12M，并为每行返回唯一值（最好在列表中）？

目前，我有一个类似以下的pandas数据框。我希望将它们按itemId进行分组，并在过去12个月内（根据有效日期）用唯一的SellerId列表替换SellerId。该有效日期采用monthEnd格式。基本上，我想查看每个月中的每个itemId，它们是过去12个月内的唯一卖方ID。

            itemId   sellerId   effectiveDate
    1975245 2585893  31280      2005-12-31
    1975246 2585893  31280      2006-02-28
    1975247 2585893  5407       2006-06-30
    1975248 2585893  5407       2006-08-31
    1975249 2585893  5407       2006-09-30
    1975250 2585893  5407       2006-11-30
    1975254 2585893  5407       2007-05-31
    1975257 2585893  5407       2007-06-30
    1975258 2585893  5407       2007-07-31
    1975259 2585893  5407       2008-03-31
    ...

我想将其分解为以下内容：

            itemId  uniqueSellerIds effectiveDate
    1975245 2585893 [31280]         2005-12-31
    1975246 2585893 [31280]         2006-02-28
    1975247 2585893 [5407,31280]    2006-06-30
    1975248 2585893 [5407,31280]    2006-08-31
    ...

我尝试过使用groupby然后滚动的方法，但是没有用。感谢您的帮助。

Answer 1

使用dt.year怎么样？

new_df = df.groupby([df["effectiveDate"].dt.year, df["itemId"]])["sellerId"].agg(list).to_frame()

print(new_df)
                                    sellerId
effectiveDate     itemId                      
2005              1975245 2585893  [31280]
2006              1975246 2585893  [31280]
                  1975247 2585893   [5407]
                  1975248 2585893   [5407]
                  1975249 2585893   [5407]
                  1975250 2585893   [5407]
2007              1975254 2585893   [5407]
                  1975257 2585893   [5407]
                  1975258 2585893   [5407]
2008              1975259 2585893   [5407]

Answer 2

我将原始DataFrame修改为此：

    itemId          sellerId   effectiveDate
    19752572585893  31280      2005-12-31
    19752572585893  31280      2006-02-28
    19752592585894  31280      2008-01-31
    19752592585894  5407       2007-07-31
    19752592585894  5407       2008-03-31
    19752592585894  5407       2008-01-31

从那里我筛选出每个itemId的最新年份：

df['effectiveDate'] = pd.to_datetime(df['effectiveDate'])
filtered = df[df.groupby(by=['itemId']).apply(lambda g: 
                                              g['effectiveDate'] >= 
                                              g['effectiveDate'].max() - 
                                              pd.Timedelta(days=365)).values]

然后我像这样组合sellerId：

filtered.groupby(by=['itemId'])['sellerId'].agg(lambda x: x.unique().tolist())

剩下的是获取最大日期并将其重新加入到过滤后的数据中：

max_dates = filtered.groupby(by=['itemId'])['effectiveDate'].max()
modified_df = pd.concat([compressed,max_dates],axis=1)

结果：

                     sellerId effectiveDate
itemId                                     
19752572585893        [31280]    2006-02-28
19752592585894  [31280, 5407]    2008-03-31

过去12个月内在熊猫集团Groupby中的独特价值

2 个答案: