问题1:
我汇总了两个大的(对我来说)文件,这些文件可以追溯到2014年1月。一个是17密耳行,另一个是大约3密耳行。我根据Date字段和PersonID字段聚合它们,并总结了一行,每行只有1个。
文件1(我删除了重复项,因此PersonID每个日期每个CustomerID只能访问1次):
Date | PersonID | CustomerID | Sum of Visits
文件2(这里没有CustomerID的完整记录,所以我不包括它 - 我想要总的聊天记录,所以我不会遗漏这么大的数据块):
Date | PersonID | Sum of Chats
当我执行pd.merge(file1,file2,how =' left')时,我最终严重增加了文件2中的聊天数。这是因为PersonID可能有多个CustomerID相同的日期,所以如果他们有多个聊天,它会为每一行添加这些聊天。当我在Tableau中加载数据并总结它时,这不能很好地工作。 (我正在寻找的最终结果是将总访问次数除以每个PersonID的总聊天数以创建比率)。这里最好的方法是什么?
问题2:
完成聚合文件后,我想再次在粒度行级别合并这两个文件。我的问题是File 2在同一个Date上可以为同一个PersonID真正拥有多个Chats。有没有办法加入/合并这个文件1,每个PersonID + Date + CustomerID只有一条记录而不在第一个文件上创建重复的访问?
文件2:
Date | PersonID | CustomerID | Count of Chat
答案 0 :(得分:1)
假设我对您的数据的看法很接近,请参阅我的方法。
首先,可重现的数据。
In [2]: d1 = {'Date': {0: pd.Timestamp('2010-01-01 00:00:00'), 1: pd.Timestamp('2010-01-02 00:00:00'), 2: pd.Timestamp('2010-01-03 00:00:00'), 3: pd.Timestamp('2010-01-03 00:00:00'), 4: pd.Timestamp('2010-01-03 00:00:00'), 5: pd.Timestamp('2010-01-06 00:00:00'), 6: pd.Timestamp('2010-01-06 00:00:00'), 7: pd.Timestamp('2010-01-06 00:00:00'), 8: pd.Timestamp('2010-01-09 00:00:00'), 9: pd.Timestamp('2010-01-10 00:00:00'), 10: pd.Timestamp('2010-01-11 00:00:00'), 11: pd.Timestamp('2010-01-12 00:00:00'), 12: pd.Timestamp('2010-01-12 00:00:00'), 13: pd.Timestamp('2010-01-12 00:00:00'), 14: pd.Timestamp('2010-01-12 00:00:00'), 15: pd.Timestamp('2010-01-12 00:00:00'), 16: pd.Timestamp('2010-01-17 00:00:00'), 17: pd.Timestamp('2010-01-17 00:00:00'), 18: pd.Timestamp('2010-01-17 00:00:00'), 19: pd.Timestamp('2010-01-17 00:00:00')}, 'PersonID': {0: 'Foo', 1: 'Bar', 2: 'Foo', 3: 'Bar', 4: 'Foo', 5: 'Bar', 6: 'Foo', 7: 'Bar', 8: 'Foo', 9: 'Bar', 10: 'Foo', 11: 'Bar', 12: 'Foo', 13: 'Bar', 14: 'Foo', 15: 'Bar', 16: 'Foo', 17: 'Bar', 18: 'Foo', 19: 'Bar'}, 'CustomerID': {0: 'aaa', 1: 'bbb', 2: 'ccc', 3: 'ddd', 4: 'eee', 5: 'fff', 6: 'ggg', 7: 'hhh', 8: 'iii', 9: 'jjj', 10: 'kkk', 11: 'lll', 12: 'mmm', 13: 'nnn', 14: 'ooo', 15: 'ppp', 16: 'qqq', 17: 'rrr', 18: 'sss', 19: 'ttt'}}
...:
...: d2 = {'Date': {0: pd.Timestamp('2010-01-01 00:00:00'), 1: pd.Timestamp('2010-01-02 00:00:00'), 2: pd.Timestamp('2010-01-03 00:00:00'), 3: pd.Timestamp('2010-01-06 00:00:00'), 4: pd.Timestamp('2010-01-09 00:00:00'), 5: pd.Timestamp('2010-01-10 00:00:00'), 6: pd.Timestamp('2010-01-11 00:00:00'), 7: pd.Timestamp('2010-01-12 00:00:00'), 8: pd.Timestamp('2010-01-17 00:00:00'), 9: pd.Timestamp('2010-01-01 00:00:00'), 10: pd.Timestamp('2010-01-02 00:00:00'), 11: pd.Timestamp('2010-01-03 00:00:00'), 12: pd.Timestamp('2010-01-06 00:00:00'), 13: pd.Timestamp('2010-01-09 00:00:00'), 14: pd.Timestamp('2010-01-10 00:00:00'), 15: pd.Timestamp('2010-01-11 00:00:00'), 16: pd.Timestamp('2010-01-12 00:00:00'), 17: pd.Timestamp('2010-01-17 00:00:00')}, 'PersonID': {0: 'Foo', 1: 'Foo', 2: 'Foo', 3: 'Foo', 4: 'Foo', 5: 'Foo', 6: 'Foo', 7: 'Foo', 8: 'Foo', 9: 'Bar', 10: 'Bar', 11: 'Bar', 12: 'Bar', 13: 'Bar', 14: 'Bar', 15: 'Bar', 16: 'Bar', 17: 'Bar'}, 'Sum of Chats': {0: 5.0, 1: 3.0, 2: 24.0, 3: 7.0, 4: 15.0, 5: 9.0, 6: 16.0, 7: 22.0, 8: 14.0, 9: 8.0, 10: 15.0, 11: 14.0, 12: 29.0, 13: 11.0, 14: 6.0, 15: 14.0, 16: 30.0, 17: 12.0}}
In [3]: df1 = pd.DataFrame.from_dict(d1)
...: df2 = pd.DataFrame.from_dict(d2)
以上产生以下数据帧。
# File 1
CustomerID Date PersonID
0 aaa 2010-01-01 Foo
1 bbb 2010-01-02 Bar
2 ccc 2010-01-03 Foo
3 ddd 2010-01-03 Bar
4 eee 2010-01-03 Foo
5 fff 2010-01-06 Bar
6 ggg 2010-01-06 Foo
7 hhh 2010-01-06 Bar
8 iii 2010-01-09 Foo
9 jjj 2010-01-10 Bar
10 kkk 2010-01-11 Foo
11 lll 2010-01-12 Bar
12 mmm 2010-01-12 Foo
13 nnn 2010-01-12 Bar
14 ooo 2010-01-12 Foo
15 ppp 2010-01-12 Bar
16 qqq 2010-01-17 Foo
17 rrr 2010-01-17 Bar
18 sss 2010-01-17 Foo
19 ttt 2010-01-17 Bar
# File 2
Date PersonID Sum of Chats
0 2010-01-01 Foo 5
1 2010-01-02 Foo 3
2 2010-01-03 Foo 24
3 2010-01-06 Foo 7
4 2010-01-09 Foo 15
5 2010-01-10 Foo 9
6 2010-01-11 Foo 16
7 2010-01-12 Foo 22
8 2010-01-17 Foo 14
9 2010-01-01 Bar 8
10 2010-01-02 Bar 15
11 2010-01-03 Bar 14
12 2010-01-06 Bar 29
13 2010-01-09 Bar 11
14 2010-01-10 Bar 6
15 2010-01-11 Bar 14
16 2010-01-12 Bar 30
17 2010-01-17 Bar 12
如果您希望使用CustomerID
计算访问次数,可以使用pivot_table
来快速汇总。
In [4]: df1 = df1.pivot_table(index=['Date','PersonID'], values='CustomerID', aggfunc=len)
...: print df1
Date PersonID
2010-01-01 Foo 1
2010-01-02 Bar 1
2010-01-03 Bar 1
Foo 2
2010-01-06 Bar 2
Foo 1
2010-01-09 Foo 1
2010-01-10 Bar 1
2010-01-11 Foo 1
2010-01-12 Bar 3
Foo 2
2010-01-17 Bar 2
Foo 2
Name: CustomerID, dtype: int64
我倾向于在汇总而不是其他方法时将其与reset_index
结合使用,因为我在旋转时可以获得上述有意义的数据。
In [5]: df1 = df1.reset_index(); print df1
Date PersonID CustomerID
0 2010-01-01 Foo 1
1 2010-01-02 Bar 1
2 2010-01-03 Bar 1
3 2010-01-03 Foo 2
4 2010-01-06 Bar 2
5 2010-01-06 Foo 1
6 2010-01-09 Foo 1
7 2010-01-10 Bar 1
8 2010-01-11 Foo 1
9 2010-01-12 Bar 3
10 2010-01-12 Foo 2
11 2010-01-17 Bar 2
12 2010-01-17 Foo 2
所以我们刚开始。剩下的步骤是将它与第二个数据帧合并,以便按人均每个日期进行聊天。
In [6]: df = pd.merge(df1, df2, how='outer', sort=True)
...: print df
Date PersonID CustomerID Sum of Chats
0 2010-01-01 Bar NaN 8
1 2010-01-01 Foo 1 5
2 2010-01-02 Bar 1 15
3 2010-01-02 Foo NaN 3
4 2010-01-03 Bar 1 14
5 2010-01-03 Foo 2 24
6 2010-01-06 Bar 2 29
7 2010-01-06 Foo 1 7
8 2010-01-09 Bar NaN 11
9 2010-01-09 Foo 1 15
10 2010-01-10 Bar 1 6
11 2010-01-10 Foo NaN 9
12 2010-01-11 Bar NaN 14
13 2010-01-11 Foo 1 16
14 2010-01-12 Bar 3 30
15 2010-01-12 Foo 2 22
16 2010-01-17 Bar 2 12
17 2010-01-17 Foo 2 14
当然,NaNs是我的错误模拟数据设置的产物。从这里开始,这只是直接计算。
如果有帮助,请告诉我。