我有一系列按时间戳和user_id排序的电子邮件。
我想调查电子邮件i随后发送电子邮件j的频率。我将在热图中跨用户显示这些频率,以显示最常见的路径。
a = """timestamp,email,subject
2016-07-01 10:17:00,a@gmail.com,subject2
2016-07-01 02:01:02,a@gmail.com,welcome
2016-07-01 14:45:04,a@gmail.com,subject3
2016-07-01 08:14:02,a@gmail.com,subject1
2016-07-01 16:26:35,a@gmail.com,subject4
2016-07-01 10:17:00,b@gmail.com,subject1
2016-07-01 02:01:02,b@gmail.com,welcome
2016-07-01 14:45:04,b@gmail.com,subject3
2016-07-01 08:14:02,b@gmail.com,subject2
2016-07-01 16:26:35,b@gmail.com,subject4
2016-07-01 18:00:00,c@gmail.com,welcome
2016-07-01 19:00:02,c@gmail.com,subject1
2016-07-01 20:00:04,c@gmail.com,subject3
2016-07-01 21:14:02,c@gmail.com,subject4
2016-07-01 21:26:35,c@gmail.com,subject2
"""
import pandas as pd
from pandas.io.parsers import StringIO
df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp'])
df1=df1.sort_values(['email','timestamp'])
排序df1:
timestamp email subject
1 2016-07-01 02:01:02 a@gmail.com welcome
3 2016-07-01 08:14:02 a@gmail.com subject1
0 2016-07-01 10:17:00 a@gmail.com subject2
2 2016-07-01 14:45:04 a@gmail.com subject3
4 2016-07-01 16:26:35 a@gmail.com subject4
6 2016-07-01 02:01:02 b@gmail.com welcome
8 2016-07-01 08:14:02 b@gmail.com subject2
5 2016-07-01 10:17:00 b@gmail.com subject1
7 2016-07-01 14:45:04 b@gmail.com subject3
9 2016-07-01 16:26:35 b@gmail.com subject4
10 2016-07-01 18:00:00 c@gmail.com welcome
11 2016-07-01 19:00:02 c@gmail.com subject1
12 2016-07-01 20:00:04 c@gmail.com subject3
13 2016-07-01 21:14:02 c@gmail.com subject4
14 2016-07-01 21:26:35 c@gmail.com subject2
输出应该如下所示
welcome subject1 subject2 subject3 subject4
welcome 0
subject1 2 0
subject2 1 1 0
subject3 0 2 1 0
subject4 0 0 0 3 0
换句话说,在欢迎电子邮件之后有2次出现subject1。在欢迎信息等之后,主题2出现了1次。
这样做的最佳方式是什么?
答案 0 :(得分:1)
双线(可以压缩成单线):
df1['next_subject'] = df1.groupby('email')['subject'].shift(-1)
res = pd.crosstab(df1['next_subject'], df1['subject'])
print(res)
# subject subject1 subject2 subject3 subject4 welcome
# next_subject
# subject1 0 1 0 0 2
# subject2 1 0 0 1 1
# subject3 2 1 0 0 0
# subject4 0 0 3 0 0
你可以按一下这个按钮来获得你在OP中引用的确切形式:
subjects = ['welcome'] + ['subject{}'.format(i) for i in range(1, 5)]
res = res.loc[subjects, subjects].fillna(0).astype(int)
print(res)
# subject welcome subject1 subject2 subject3 subject4
# next_subject
# welcome 0 0 0 0 0
# subject1 2 0 1 0 0
# subject2 1 1 0 0 1
# subject3 0 2 1 0 0
# subject4 0 0 0 3 0