每个用户的每个产品的会话持续时间

时间:2016-05-15 07:07:04

标签: python numpy pandas dataset data-analysis

我有一个像这样的数据集

users                                  b      kk        timstamp  product
8fa683e59c02c04cb781ac689686db07    start   1.46276E+12 00:00.0 55107008    
335644267c1d5f04eaea7bc6f51b1861    start   1.46276E+12 00:00.0 55107008    
ca3071aad676bc963795a2b09635cdf0    stop    1.46277E+12 00:00.0 55107008    
17412dec7d3d02c9b0b1c3d1c3571c5c    stop    1.46276E+12 00:00.0 10655437    
f81167c854f1a0c86cab6188f9995824    start   1.46276E+12 00:00.1 55107008    
17412dec7d3d02c9b0b1c3d1c3571c5c    start   1.46276E+12 00:00.1 10655437    
a2659df45c8d05f326225fa5b1063ac9    start   1.46276E+12 00:00.1 30900473    
b8bbef76f8dfee2fe190a283cd5a19a7    start   1.46276E+12 00:00.1 18121481    
e8ebfc3f39512eda3aa0702b13ffed63    start   1.46276E+12 00:00.1 18121481    
988e4873861347113519fbee6dd1c3b0    start   1.46276E+12 00:00.2 55107008    
583361d66ad8b0827cd08d3a5d64af89    stop    1.46276E+12 00:00.2 55107008    

用户,b,时间,产品是列。

我必须为每个产品确定每个用户的会话。会话定义为difference between the timestamp of stop and start。 请记住:

there can be many users buying the same product,
each customer have more than one product bought

此处时间戳包括数据和时间,例如(5/9/2016 2:00:00 AM)

1 个答案:

答案 0 :(得分:3)

您可以使用pivot_table。带样本的输出有很多NaN(因为缺少startstop值),但我认为它可以很好地处理实际数据:

df1 = pd.pivot_table(df, index=['users','product'], columns='b', values='timestamp')
        .reset_index()
print df1
b                             users   product         start          stop
0  17412dec7d3d02c9b0b1c3d1c3571c5c  10655437  1.462760e+12  1.462760e+12
1  335644267c1d5f04eaea7bc6f51b1861  55107008  1.462760e+12           NaN
2  583361d66ad8b0827cd08d3a5d64af89  55107008           NaN  1.462760e+12
3  8fa683e59c02c04cb781ac689686db07  55107008  1.462760e+12           NaN
4  988e4873861347113519fbee6dd1c3b0  55107008  1.462760e+12           NaN
5  a2659df45c8d05f326225fa5b1063ac9  30900473  1.462760e+12           NaN
6  b8bbef76f8dfee2fe190a283cd5a19a7  18121481  1.462760e+12           NaN
7  ca3071aad676bc963795a2b09635cdf0  55107008           NaN  1.462770e+12
8  e8ebfc3f39512eda3aa0702b13ffed63  18121481  1.462760e+12           NaN
9  f81167c854f1a0c86cab6188f9995824  55107008  1.462760e+12           NaN
df1['diff'] = df1['start'] - df1['stop'] 
print df1
b                             users   product         start          stop  \
0  17412dec7d3d02c9b0b1c3d1c3571c5c  10655437  1.462760e+12  1.462760e+12   
1  335644267c1d5f04eaea7bc6f51b1861  55107008  1.462760e+12           NaN   
2  583361d66ad8b0827cd08d3a5d64af89  55107008           NaN  1.462760e+12   
3  8fa683e59c02c04cb781ac689686db07  55107008  1.462760e+12           NaN   
4  988e4873861347113519fbee6dd1c3b0  55107008  1.462760e+12           NaN   
5  a2659df45c8d05f326225fa5b1063ac9  30900473  1.462760e+12           NaN   
6  b8bbef76f8dfee2fe190a283cd5a19a7  18121481  1.462760e+12           NaN   
7  ca3071aad676bc963795a2b09635cdf0  55107008           NaN  1.462770e+12   
8  e8ebfc3f39512eda3aa0702b13ffed63  18121481  1.462760e+12           NaN   
9  f81167c854f1a0c86cab6188f9995824  55107008  1.462760e+12           NaN   

b  diff  
0   0.0  
1   NaN  
2   NaN  
3   NaN  
4   NaN  
5   NaN  
6   NaN  
7   NaN  
8   NaN  

编辑:

您必须先使用参数to_datetime将列timstamp转换为format,然后将aggfunc='first'添加到pivot_table,以便按first汇总:

df['timstamp'] = pd.to_datetime(df['timstamp'], format='%H:%M.%S')

df1 = pd.pivot_table(df, index=['users','product'], columns='b', values='timstamp', aggfunc='first')
        .reset_index()

print df1
b                             users   product               start  \
0  17412dec7d3d02c9b0b1c3d1c3571c5c  10655437 1900-01-01 00:00:01   
1  335644267c1d5f04eaea7bc6f51b1861  55107008 1900-01-01 00:00:00   
2  583361d66ad8b0827cd08d3a5d64af89  55107008                 NaT   
3  8fa683e59c02c04cb781ac689686db07  55107008 1900-01-01 00:00:00   
4  988e4873861347113519fbee6dd1c3b0  55107008 1900-01-01 00:00:02   
5  a2659df45c8d05f326225fa5b1063ac9  30900473 1900-01-01 00:00:01   
6  b8bbef76f8dfee2fe190a283cd5a19a7  18121481 1900-01-01 00:00:01   
7  ca3071aad676bc963795a2b09635cdf0  55107008                 NaT   
8  e8ebfc3f39512eda3aa0702b13ffed63  18121481 1900-01-01 00:00:01   
9  f81167c854f1a0c86cab6188f9995824  55107008 1900-01-01 00:00:01   

b                stop  
0 1900-01-01 00:00:00  
1                 NaT  
2 1900-01-01 00:00:02  
3                 NaT  
4                 NaT  
5                 NaT  
6                 NaT  
7 1900-01-01 00:00:00  
8                 NaT  
9                 NaT  
df1['diff'] = df1['start'] - df1['stop'] 
print df1
b                             users   product               start  \
0  17412dec7d3d02c9b0b1c3d1c3571c5c  10655437 1900-01-01 00:00:01   
1  335644267c1d5f04eaea7bc6f51b1861  55107008 1900-01-01 00:00:00   
2  583361d66ad8b0827cd08d3a5d64af89  55107008                 NaT   
3  8fa683e59c02c04cb781ac689686db07  55107008 1900-01-01 00:00:00   
4  988e4873861347113519fbee6dd1c3b0  55107008 1900-01-01 00:00:02   
5  a2659df45c8d05f326225fa5b1063ac9  30900473 1900-01-01 00:00:01   
6  b8bbef76f8dfee2fe190a283cd5a19a7  18121481 1900-01-01 00:00:01   
7  ca3071aad676bc963795a2b09635cdf0  55107008                 NaT   
8  e8ebfc3f39512eda3aa0702b13ffed63  18121481 1900-01-01 00:00:01   
9  f81167c854f1a0c86cab6188f9995824  55107008 1900-01-01 00:00:01   

b                stop     diff  
0 1900-01-01 00:00:00 00:00:01  
1                 NaT      NaT  
2 1900-01-01 00:00:02      NaT  
3                 NaT      NaT  
4                 NaT      NaT  
5                 NaT      NaT  
6                 NaT      NaT  
7 1900-01-01 00:00:00      NaT  
8                 NaT      NaT  
9                 NaT      NaT  

EDIT1:

我使用新格式datetime创建新样本:

import pandas as pd

df = pd.DataFrame({'kk': {0: 1462760000000.0, 1: 1462760000000.0, 2: 1462770000000.0, 3: 1462760000000.0, 4: 1462760000000.0, 5: 1462760000000.0, 6: 1462760000000.0, 7: 1462760000000.0, 8: 1462760000000.0, 9: 1462760000000.0, 10: 1462760000000.0}, 
'product': {0: 55107008, 1: 55107008, 2: 55107008, 3: 10655437, 4: 55107008, 5: 10655437, 6: 30900473, 7: 18121481, 8: 18121481, 9: 55107008, 10: 55107008}, 
'b': {0: 'start', 1: 'start', 2: 'stop', 3: 'stop', 4: 'start', 5: 'start', 6: 'start', 7: 'start', 8: 'start', 9: 'start', 10: 'stop'}, 
'users': {0: '8fa683e59c02c04cb781ac689686db07', 1: '335644267c1d5f04eaea7bc6f51b1861', 2: 'ca3071aad676bc963795a2b09635cdf0', 3: '17412dec7d3d02c9b0b1c3d1c3571c5c', 4: 'f81167c854f1a0c86cab6188f9995824', 5: '17412dec7d3d02c9b0b1c3d1c3571c5c', 6: 'a2659df45c8d05f326225fa5b1063ac9', 7: 'b8bbef76f8dfee2fe190a283cd5a19a7', 8: 'e8ebfc3f39512eda3aa0702b13ffed63', 9: '988e4873861347113519fbee6dd1c3b0', 10: '583361d66ad8b0827cd08d3a5d64af89'}, 
'timstamp': {0: '5/9/2016 2:00:00', 1: '5/9/2016 2:00:00', 2: '5/9/2016 2:00:00', 3: '5/9/2016 2:00:00', 4: '5/9/2016 2:00:00', 5: '5/9/2016 3:00:00', 6: '5/9/2016 2:00:00', 7: '5/9/2016 2:00:00', 8: '5/9/2016 2:00:00', 9: '5/9/2016 2:00:00', 10: '5/9/2016 2:00:00'}})
print df
        b            kk   product          timstamp  \
0   start  1.462760e+12  55107008  5/9/2016 2:00:00   
1   start  1.462760e+12  55107008  5/9/2016 2:00:00   
2    stop  1.462770e+12  55107008  5/9/2016 2:00:00   
3    stop  1.462760e+12  10655437  5/9/2016 2:00:00   
4   start  1.462760e+12  55107008  5/9/2016 2:00:00   
5   start  1.462760e+12  10655437  5/9/2016 3:00:00   
6   start  1.462760e+12  30900473  5/9/2016 2:00:00   
7   start  1.462760e+12  18121481  5/9/2016 2:00:00   
8   start  1.462760e+12  18121481  5/9/2016 2:00:00   
9   start  1.462760e+12  55107008  5/9/2016 2:00:00   
10   stop  1.462760e+12  55107008  5/9/2016 2:00:00   

                               users  
0   8fa683e59c02c04cb781ac689686db07  
1   335644267c1d5f04eaea7bc6f51b1861  
2   ca3071aad676bc963795a2b09635cdf0  
3   17412dec7d3d02c9b0b1c3d1c3571c5c  
4   f81167c854f1a0c86cab6188f9995824  
5   17412dec7d3d02c9b0b1c3d1c3571c5c  
6   a2659df45c8d05f326225fa5b1063ac9  
7   b8bbef76f8dfee2fe190a283cd5a19a7  
8   e8ebfc3f39512eda3aa0702b13ffed63  
9   988e4873861347113519fbee6dd1c3b0  
10  583361d66ad8b0827cd08d3a5d64af89 
df['timstamp'] = pd.to_datetime(df['timstamp'], format='%m/%d/%Y %H:%M:%S')
df1 = pd.pivot_table(df, index=['users','product'], columns='b', values='timstamp', aggfunc='first').reset_index()
df1['diff'] = df1['start'] - df1['stop'] 
print df1
b                             users   product               start  \
0  17412dec7d3d02c9b0b1c3d1c3571c5c  10655437 2016-05-09 03:00:00   
1  335644267c1d5f04eaea7bc6f51b1861  55107008 2016-05-09 02:00:00   
2  583361d66ad8b0827cd08d3a5d64af89  55107008                 NaT   
3  8fa683e59c02c04cb781ac689686db07  55107008 2016-05-09 02:00:00   
4  988e4873861347113519fbee6dd1c3b0  55107008 2016-05-09 02:00:00   
5  a2659df45c8d05f326225fa5b1063ac9  30900473 2016-05-09 02:00:00   
6  b8bbef76f8dfee2fe190a283cd5a19a7  18121481 2016-05-09 02:00:00   
7  ca3071aad676bc963795a2b09635cdf0  55107008                 NaT   
8  e8ebfc3f39512eda3aa0702b13ffed63  18121481 2016-05-09 02:00:00   
9  f81167c854f1a0c86cab6188f9995824  55107008 2016-05-09 02:00:00   

b                stop     diff  
0 2016-05-09 02:00:00 01:00:00  
1                 NaT      NaT  
2 2016-05-09 02:00:00      NaT  
3                 NaT      NaT  
4                 NaT      NaT  
5                 NaT      NaT  
6                 NaT      NaT  
7 2016-05-09 02:00:00      NaT  
8                 NaT      NaT  
9                 NaT      NaT