如何在未预定义大小的块上使用有序记录拆分熊猫数据框?

时间:2018-09-06 20:07:36

标签: python pandas dataframe split

存在用户操作的记录,在示例中将其简化为“已购买”和“其他”(表1)。

我正在尝试添加列“ purchase_cycle”,该列的编号将指示一个组,该组包含从上次购买到当前购买的所有用户操作(如果是首次购买,则从步骤1到所有用户的操作)。如果有不以“购买”结束的一组动作,则该组不算作完整周期并分配为Nan。

TABLE1(添加了新行以使其更具可读性):

user_id     actions_order   action_category
0043e1a6    1               purchased      
0043e1a6    2               other          
0043e1a6    3               other          

0070f782    1               other          
0070f782    2               other          
0070f782    3               other          
0070f782    4               other          
0070f782    5               other          
0070f782    6               purchased      
0070f782    7               other          
0070f782    8               other          
0070f782    9               other          
0070f782    10              purchased      
0070f782    11              other          
0070f782    12              other          
0070f782    13              other          

008aa58a    1               other          
008aa58a    2               other          
008aa58a    3               other          
008aa58a    4               other          
008aa58a    5               purchased      
008aa58a    6               other          
008aa58a    7               other          
008aa58a    8               other          
008aa58a    9               other          
008aa58a    10              other          
008aa58a    11              other          
008aa58a    12              purchased      
008aa58a    13              other          
008aa58a    14              other          
008aa58a    15              other          

TABLE2(购买周期):

user_id     actions_order   action_category    purchase_cycle
0043e1a6    1               purchased          1
0043e1a6    2               other              nan
0043e1a6    3               other              nan

0070f782    1               other              1
0070f782    2               other              1
0070f782    3               other              1
0070f782    4               other              1
0070f782    5               other              1
0070f782    6               purchased          1
0070f782    7               other              2
0070f782    8               other              2
0070f782    9               other              2
0070f782    10              purchased          2
0070f782    11              other              nan
0070f782    12              other              nan
0070f782    13              other              nan           

008aa58a    1               other              1
008aa58a    2               other              1
008aa58a    3               other              1
008aa58a    4               other              1
008aa58a    5               purchased          1
008aa58a    6               other              2
008aa58a    7               other              2
008aa58a    8               other              2
008aa58a    9               other              2
008aa58a    10              other              2
008aa58a    11              other              2
008aa58a    12              purchased          2
008aa58a    13              other              nan
008aa58a    14              other              nan
008aa58a    15              other              nan

我只能找到James Schinner answer,但是他的解决方案假定所有组的块大小都是相同的,这不是我的情况。

def chunk(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

for df_chunk in chunk(df, 100):
    #                     |
    #                     The chunk size
    # your code here
    pass

0 个答案:

没有答案