我有一个像这样的数据集
users b kk timstamp product
8fa683e59c02c04cb781ac689686db07 start 1.46276E+12 00:00.0 55107008
335644267c1d5f04eaea7bc6f51b1861 start 1.46276E+12 00:00.0 55107008
ca3071aad676bc963795a2b09635cdf0 stop 1.46277E+12 00:00.0 55107008
17412dec7d3d02c9b0b1c3d1c3571c5c stop 1.46276E+12 00:00.0 10655437
f81167c854f1a0c86cab6188f9995824 start 1.46276E+12 00:00.1 55107008
17412dec7d3d02c9b0b1c3d1c3571c5c start 1.46276E+12 00:00.1 10655437
a2659df45c8d05f326225fa5b1063ac9 start 1.46276E+12 00:00.1 30900473
b8bbef76f8dfee2fe190a283cd5a19a7 start 1.46276E+12 00:00.1 18121481
e8ebfc3f39512eda3aa0702b13ffed63 start 1.46276E+12 00:00.1 18121481
988e4873861347113519fbee6dd1c3b0 start 1.46276E+12 00:00.2 55107008
583361d66ad8b0827cd08d3a5d64af89 stop 1.46276E+12 00:00.2 55107008
用户,b,时间,产品是列。
我必须为每个产品确定每个用户的会话。会话定义为difference between the timestamp of stop and start
。
请记住:
there can be many users buying the same product,
each customer have more than one product bought
此处时间戳包括数据和时间,例如(5/9/2016 2:00:00 AM)
答案 0 :(得分:3)
您可以使用pivot_table
。带样本的输出有很多NaN
(因为缺少start
或stop
值),但我认为它可以很好地处理实际数据:
df1 = pd.pivot_table(df, index=['users','product'], columns='b', values='timestamp')
.reset_index()
print df1
b users product start stop
0 17412dec7d3d02c9b0b1c3d1c3571c5c 10655437 1.462760e+12 1.462760e+12
1 335644267c1d5f04eaea7bc6f51b1861 55107008 1.462760e+12 NaN
2 583361d66ad8b0827cd08d3a5d64af89 55107008 NaN 1.462760e+12
3 8fa683e59c02c04cb781ac689686db07 55107008 1.462760e+12 NaN
4 988e4873861347113519fbee6dd1c3b0 55107008 1.462760e+12 NaN
5 a2659df45c8d05f326225fa5b1063ac9 30900473 1.462760e+12 NaN
6 b8bbef76f8dfee2fe190a283cd5a19a7 18121481 1.462760e+12 NaN
7 ca3071aad676bc963795a2b09635cdf0 55107008 NaN 1.462770e+12
8 e8ebfc3f39512eda3aa0702b13ffed63 18121481 1.462760e+12 NaN
9 f81167c854f1a0c86cab6188f9995824 55107008 1.462760e+12 NaN
df1['diff'] = df1['start'] - df1['stop']
print df1
b users product start stop \
0 17412dec7d3d02c9b0b1c3d1c3571c5c 10655437 1.462760e+12 1.462760e+12
1 335644267c1d5f04eaea7bc6f51b1861 55107008 1.462760e+12 NaN
2 583361d66ad8b0827cd08d3a5d64af89 55107008 NaN 1.462760e+12
3 8fa683e59c02c04cb781ac689686db07 55107008 1.462760e+12 NaN
4 988e4873861347113519fbee6dd1c3b0 55107008 1.462760e+12 NaN
5 a2659df45c8d05f326225fa5b1063ac9 30900473 1.462760e+12 NaN
6 b8bbef76f8dfee2fe190a283cd5a19a7 18121481 1.462760e+12 NaN
7 ca3071aad676bc963795a2b09635cdf0 55107008 NaN 1.462770e+12
8 e8ebfc3f39512eda3aa0702b13ffed63 18121481 1.462760e+12 NaN
9 f81167c854f1a0c86cab6188f9995824 55107008 1.462760e+12 NaN
b diff
0 0.0
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
编辑:
您必须先使用参数to_datetime
将列timstamp
转换为format
,然后将aggfunc='first'
添加到pivot_table
,以便按first
汇总:
df['timstamp'] = pd.to_datetime(df['timstamp'], format='%H:%M.%S')
df1 = pd.pivot_table(df, index=['users','product'], columns='b', values='timstamp', aggfunc='first')
.reset_index()
print df1
b users product start \
0 17412dec7d3d02c9b0b1c3d1c3571c5c 10655437 1900-01-01 00:00:01
1 335644267c1d5f04eaea7bc6f51b1861 55107008 1900-01-01 00:00:00
2 583361d66ad8b0827cd08d3a5d64af89 55107008 NaT
3 8fa683e59c02c04cb781ac689686db07 55107008 1900-01-01 00:00:00
4 988e4873861347113519fbee6dd1c3b0 55107008 1900-01-01 00:00:02
5 a2659df45c8d05f326225fa5b1063ac9 30900473 1900-01-01 00:00:01
6 b8bbef76f8dfee2fe190a283cd5a19a7 18121481 1900-01-01 00:00:01
7 ca3071aad676bc963795a2b09635cdf0 55107008 NaT
8 e8ebfc3f39512eda3aa0702b13ffed63 18121481 1900-01-01 00:00:01
9 f81167c854f1a0c86cab6188f9995824 55107008 1900-01-01 00:00:01
b stop
0 1900-01-01 00:00:00
1 NaT
2 1900-01-01 00:00:02
3 NaT
4 NaT
5 NaT
6 NaT
7 1900-01-01 00:00:00
8 NaT
9 NaT
df1['diff'] = df1['start'] - df1['stop']
print df1
b users product start \
0 17412dec7d3d02c9b0b1c3d1c3571c5c 10655437 1900-01-01 00:00:01
1 335644267c1d5f04eaea7bc6f51b1861 55107008 1900-01-01 00:00:00
2 583361d66ad8b0827cd08d3a5d64af89 55107008 NaT
3 8fa683e59c02c04cb781ac689686db07 55107008 1900-01-01 00:00:00
4 988e4873861347113519fbee6dd1c3b0 55107008 1900-01-01 00:00:02
5 a2659df45c8d05f326225fa5b1063ac9 30900473 1900-01-01 00:00:01
6 b8bbef76f8dfee2fe190a283cd5a19a7 18121481 1900-01-01 00:00:01
7 ca3071aad676bc963795a2b09635cdf0 55107008 NaT
8 e8ebfc3f39512eda3aa0702b13ffed63 18121481 1900-01-01 00:00:01
9 f81167c854f1a0c86cab6188f9995824 55107008 1900-01-01 00:00:01
b stop diff
0 1900-01-01 00:00:00 00:00:01
1 NaT NaT
2 1900-01-01 00:00:02 NaT
3 NaT NaT
4 NaT NaT
5 NaT NaT
6 NaT NaT
7 1900-01-01 00:00:00 NaT
8 NaT NaT
9 NaT NaT
EDIT1:
我使用新格式datetime
创建新样本:
import pandas as pd
df = pd.DataFrame({'kk': {0: 1462760000000.0, 1: 1462760000000.0, 2: 1462770000000.0, 3: 1462760000000.0, 4: 1462760000000.0, 5: 1462760000000.0, 6: 1462760000000.0, 7: 1462760000000.0, 8: 1462760000000.0, 9: 1462760000000.0, 10: 1462760000000.0},
'product': {0: 55107008, 1: 55107008, 2: 55107008, 3: 10655437, 4: 55107008, 5: 10655437, 6: 30900473, 7: 18121481, 8: 18121481, 9: 55107008, 10: 55107008},
'b': {0: 'start', 1: 'start', 2: 'stop', 3: 'stop', 4: 'start', 5: 'start', 6: 'start', 7: 'start', 8: 'start', 9: 'start', 10: 'stop'},
'users': {0: '8fa683e59c02c04cb781ac689686db07', 1: '335644267c1d5f04eaea7bc6f51b1861', 2: 'ca3071aad676bc963795a2b09635cdf0', 3: '17412dec7d3d02c9b0b1c3d1c3571c5c', 4: 'f81167c854f1a0c86cab6188f9995824', 5: '17412dec7d3d02c9b0b1c3d1c3571c5c', 6: 'a2659df45c8d05f326225fa5b1063ac9', 7: 'b8bbef76f8dfee2fe190a283cd5a19a7', 8: 'e8ebfc3f39512eda3aa0702b13ffed63', 9: '988e4873861347113519fbee6dd1c3b0', 10: '583361d66ad8b0827cd08d3a5d64af89'},
'timstamp': {0: '5/9/2016 2:00:00', 1: '5/9/2016 2:00:00', 2: '5/9/2016 2:00:00', 3: '5/9/2016 2:00:00', 4: '5/9/2016 2:00:00', 5: '5/9/2016 3:00:00', 6: '5/9/2016 2:00:00', 7: '5/9/2016 2:00:00', 8: '5/9/2016 2:00:00', 9: '5/9/2016 2:00:00', 10: '5/9/2016 2:00:00'}})
print df
b kk product timstamp \
0 start 1.462760e+12 55107008 5/9/2016 2:00:00
1 start 1.462760e+12 55107008 5/9/2016 2:00:00
2 stop 1.462770e+12 55107008 5/9/2016 2:00:00
3 stop 1.462760e+12 10655437 5/9/2016 2:00:00
4 start 1.462760e+12 55107008 5/9/2016 2:00:00
5 start 1.462760e+12 10655437 5/9/2016 3:00:00
6 start 1.462760e+12 30900473 5/9/2016 2:00:00
7 start 1.462760e+12 18121481 5/9/2016 2:00:00
8 start 1.462760e+12 18121481 5/9/2016 2:00:00
9 start 1.462760e+12 55107008 5/9/2016 2:00:00
10 stop 1.462760e+12 55107008 5/9/2016 2:00:00
users
0 8fa683e59c02c04cb781ac689686db07
1 335644267c1d5f04eaea7bc6f51b1861
2 ca3071aad676bc963795a2b09635cdf0
3 17412dec7d3d02c9b0b1c3d1c3571c5c
4 f81167c854f1a0c86cab6188f9995824
5 17412dec7d3d02c9b0b1c3d1c3571c5c
6 a2659df45c8d05f326225fa5b1063ac9
7 b8bbef76f8dfee2fe190a283cd5a19a7
8 e8ebfc3f39512eda3aa0702b13ffed63
9 988e4873861347113519fbee6dd1c3b0
10 583361d66ad8b0827cd08d3a5d64af89
df['timstamp'] = pd.to_datetime(df['timstamp'], format='%m/%d/%Y %H:%M:%S')
df1 = pd.pivot_table(df, index=['users','product'], columns='b', values='timstamp', aggfunc='first').reset_index()
df1['diff'] = df1['start'] - df1['stop']
print df1
b users product start \
0 17412dec7d3d02c9b0b1c3d1c3571c5c 10655437 2016-05-09 03:00:00
1 335644267c1d5f04eaea7bc6f51b1861 55107008 2016-05-09 02:00:00
2 583361d66ad8b0827cd08d3a5d64af89 55107008 NaT
3 8fa683e59c02c04cb781ac689686db07 55107008 2016-05-09 02:00:00
4 988e4873861347113519fbee6dd1c3b0 55107008 2016-05-09 02:00:00
5 a2659df45c8d05f326225fa5b1063ac9 30900473 2016-05-09 02:00:00
6 b8bbef76f8dfee2fe190a283cd5a19a7 18121481 2016-05-09 02:00:00
7 ca3071aad676bc963795a2b09635cdf0 55107008 NaT
8 e8ebfc3f39512eda3aa0702b13ffed63 18121481 2016-05-09 02:00:00
9 f81167c854f1a0c86cab6188f9995824 55107008 2016-05-09 02:00:00
b stop diff
0 2016-05-09 02:00:00 01:00:00
1 NaT NaT
2 2016-05-09 02:00:00 NaT
3 NaT NaT
4 NaT NaT
5 NaT NaT
6 NaT NaT
7 2016-05-09 02:00:00 NaT
8 NaT NaT
9 NaT NaT