Python - 按组计算连续频率

时间:2016-08-24 13:57:15

标签: python pandas sequence frequency itertools

我有一系列按时间戳和user_id排序的电子邮件。

我想调查电子邮件i随后发送电子邮件j的频率。我将在热图中跨用户显示这些频率,以显示最常见的路径。

a = """timestamp,email,subject
2016-07-01 10:17:00,a@gmail.com,subject2
2016-07-01 02:01:02,a@gmail.com,welcome
2016-07-01 14:45:04,a@gmail.com,subject3
2016-07-01 08:14:02,a@gmail.com,subject1
2016-07-01 16:26:35,a@gmail.com,subject4
2016-07-01 10:17:00,b@gmail.com,subject1
2016-07-01 02:01:02,b@gmail.com,welcome
2016-07-01 14:45:04,b@gmail.com,subject3
2016-07-01 08:14:02,b@gmail.com,subject2
2016-07-01 16:26:35,b@gmail.com,subject4
2016-07-01 18:00:00,c@gmail.com,welcome
2016-07-01 19:00:02,c@gmail.com,subject1
2016-07-01 20:00:04,c@gmail.com,subject3
2016-07-01 21:14:02,c@gmail.com,subject4
2016-07-01 21:26:35,c@gmail.com,subject2
"""

import pandas as pd
from pandas.io.parsers import StringIO
df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp'])
df1=df1.sort_values(['email','timestamp'])

排序df1:

        timestamp        email   subject
 1  2016-07-01 02:01:02  a@gmail.com   welcome
 3  2016-07-01 08:14:02  a@gmail.com  subject1
 0  2016-07-01 10:17:00  a@gmail.com  subject2
 2  2016-07-01 14:45:04  a@gmail.com  subject3
 4  2016-07-01 16:26:35  a@gmail.com  subject4
 6  2016-07-01 02:01:02  b@gmail.com   welcome
 8  2016-07-01 08:14:02  b@gmail.com  subject2
 5  2016-07-01 10:17:00  b@gmail.com  subject1
 7  2016-07-01 14:45:04  b@gmail.com  subject3
 9  2016-07-01 16:26:35  b@gmail.com  subject4
 10 2016-07-01 18:00:00  c@gmail.com   welcome
 11 2016-07-01 19:00:02  c@gmail.com  subject1
 12 2016-07-01 20:00:04  c@gmail.com  subject3
 13 2016-07-01 21:14:02  c@gmail.com  subject4
 14 2016-07-01 21:26:35  c@gmail.com  subject2

输出应该如下所示

          welcome   subject1    subject2    subject3    subject4
welcome      0              
subject1     2         0                    
subject2     1         1          0     
subject3     0         2          1           0 
subject4     0         0          0           3             0

换句话说,在欢迎电子邮件之后有2次出现subject1。在欢迎信息等之后,主题2出现了1次。

这样做的最佳方式是什么?

1 个答案:

答案 0 :(得分:1)

双线(可以压缩成单线):

df1['next_subject'] = df1.groupby('email')['subject'].shift(-1)
res = pd.crosstab(df1['next_subject'], df1['subject'])
print(res)

# subject       subject1  subject2  subject3  subject4  welcome
# next_subject                                                 
# subject1             0         1         0         0        2
# subject2             1         0         0         1        1
# subject3             2         1         0         0        0
# subject4             0         0         3         0        0

你可以按一下这个按钮来获得你在OP中引用的确切形式:

subjects = ['welcome'] + ['subject{}'.format(i) for i in range(1, 5)]
res = res.loc[subjects, subjects].fillna(0).astype(int)
print(res)

# subject       welcome  subject1  subject2  subject3  subject4
# next_subject                                                 
# welcome             0         0         0         0         0
# subject1            2         0         1         0         0
# subject2            1         1         0         0         1
# subject3            0         2         1         0         0
# subject4            0         0         0         3         0