避免在熊猫合并中重复计算

时间:2015-04-29 00:49:57

标签: python pandas

问题1:

我汇总了两个大的(对我来说)文件,这些文件可以追溯到2014年1月。一个是17密耳行,另一个是大约3密耳行。我根据Date字段和PersonID字段聚合它们,并总结了一行,每行只有1个。

文件1(我删除了重复项,因此PersonID每个日期每个CustomerID只能访问1次):

Date | PersonID | CustomerID | Sum of Visits

文件2(这里没有CustomerID的完整记录,所以我不包括它 - 我想要总的聊天记录,所以我不会遗漏这么大的数据块):

Date | PersonID | Sum of Chats

当我执行pd.merge(file1,file2,how =' left')时,我最终严重增加了文件2中的聊天数。这是因为PersonID可能有多个CustomerID相同的日期,所以如果他们有多个聊天,它会为每一行添加这些聊天。当我在Tableau中加载数据并总结它时,这不能很好地工作。 (我正在寻找的最终结果是将总访问次数除以每个PersonID的总聊天数以创建比率)。这里最好的方法是什么?

问题2:

完成聚合文件后,我想再次在粒度行级别合并这两个文件。我的问题是File 2在同一个Date上可以为同一个PersonID真正拥有多个Chats。有没有办法加入/合并这个文件1,每个PersonID + Date + CustomerID只有一条记录而不在第一个文件上创建重复的访问?

文件2:

Date | PersonID | CustomerID | Count of Chat

1 个答案:

答案 0 :(得分:1)

假设我对您的数据的看法很接近,请参阅我的方法。

首先,可重现的数据。

In [2]: d1 = {'Date': {0: pd.Timestamp('2010-01-01 00:00:00'), 1: pd.Timestamp('2010-01-02 00:00:00'), 2: pd.Timestamp('2010-01-03 00:00:00'), 3: pd.Timestamp('2010-01-03 00:00:00'), 4: pd.Timestamp('2010-01-03 00:00:00'), 5: pd.Timestamp('2010-01-06 00:00:00'), 6: pd.Timestamp('2010-01-06 00:00:00'), 7: pd.Timestamp('2010-01-06 00:00:00'), 8: pd.Timestamp('2010-01-09 00:00:00'), 9: pd.Timestamp('2010-01-10 00:00:00'), 10: pd.Timestamp('2010-01-11 00:00:00'), 11: pd.Timestamp('2010-01-12 00:00:00'), 12: pd.Timestamp('2010-01-12 00:00:00'), 13: pd.Timestamp('2010-01-12 00:00:00'), 14: pd.Timestamp('2010-01-12 00:00:00'), 15: pd.Timestamp('2010-01-12 00:00:00'), 16: pd.Timestamp('2010-01-17 00:00:00'), 17: pd.Timestamp('2010-01-17 00:00:00'), 18: pd.Timestamp('2010-01-17 00:00:00'), 19: pd.Timestamp('2010-01-17 00:00:00')}, 'PersonID': {0: 'Foo', 1: 'Bar', 2: 'Foo', 3: 'Bar', 4: 'Foo', 5: 'Bar', 6: 'Foo', 7: 'Bar', 8: 'Foo', 9: 'Bar', 10: 'Foo', 11: 'Bar', 12: 'Foo', 13: 'Bar', 14: 'Foo', 15: 'Bar', 16: 'Foo', 17: 'Bar', 18: 'Foo', 19: 'Bar'}, 'CustomerID': {0: 'aaa', 1: 'bbb', 2: 'ccc', 3: 'ddd', 4: 'eee', 5: 'fff', 6: 'ggg', 7: 'hhh', 8: 'iii', 9: 'jjj', 10: 'kkk', 11: 'lll', 12: 'mmm', 13: 'nnn', 14: 'ooo', 15: 'ppp', 16: 'qqq', 17: 'rrr', 18: 'sss', 19: 'ttt'}}
   ...: 
   ...: d2 = {'Date': {0: pd.Timestamp('2010-01-01 00:00:00'), 1: pd.Timestamp('2010-01-02 00:00:00'), 2: pd.Timestamp('2010-01-03 00:00:00'), 3: pd.Timestamp('2010-01-06 00:00:00'), 4: pd.Timestamp('2010-01-09 00:00:00'), 5: pd.Timestamp('2010-01-10 00:00:00'), 6: pd.Timestamp('2010-01-11 00:00:00'), 7: pd.Timestamp('2010-01-12 00:00:00'), 8: pd.Timestamp('2010-01-17 00:00:00'), 9: pd.Timestamp('2010-01-01 00:00:00'), 10: pd.Timestamp('2010-01-02 00:00:00'), 11: pd.Timestamp('2010-01-03 00:00:00'), 12: pd.Timestamp('2010-01-06 00:00:00'), 13: pd.Timestamp('2010-01-09 00:00:00'), 14: pd.Timestamp('2010-01-10 00:00:00'), 15: pd.Timestamp('2010-01-11 00:00:00'), 16: pd.Timestamp('2010-01-12 00:00:00'), 17: pd.Timestamp('2010-01-17 00:00:00')}, 'PersonID': {0: 'Foo', 1: 'Foo', 2: 'Foo', 3: 'Foo', 4: 'Foo', 5: 'Foo', 6: 'Foo', 7: 'Foo', 8: 'Foo', 9: 'Bar', 10: 'Bar', 11: 'Bar', 12: 'Bar', 13: 'Bar', 14: 'Bar', 15: 'Bar', 16: 'Bar', 17: 'Bar'}, 'Sum of Chats': {0: 5.0, 1: 3.0, 2: 24.0, 3: 7.0, 4: 15.0, 5: 9.0, 6: 16.0, 7: 22.0, 8: 14.0, 9: 8.0, 10: 15.0, 11: 14.0, 12: 29.0, 13: 11.0, 14: 6.0, 15: 14.0, 16: 30.0, 17: 12.0}}

In [3]: df1 = pd.DataFrame.from_dict(d1)
   ...: df2 = pd.DataFrame.from_dict(d2)

以上产生以下数据帧。

# File 1

   CustomerID       Date PersonID
0         aaa 2010-01-01      Foo
1         bbb 2010-01-02      Bar
2         ccc 2010-01-03      Foo
3         ddd 2010-01-03      Bar
4         eee 2010-01-03      Foo
5         fff 2010-01-06      Bar
6         ggg 2010-01-06      Foo
7         hhh 2010-01-06      Bar
8         iii 2010-01-09      Foo
9         jjj 2010-01-10      Bar
10        kkk 2010-01-11      Foo
11        lll 2010-01-12      Bar
12        mmm 2010-01-12      Foo
13        nnn 2010-01-12      Bar
14        ooo 2010-01-12      Foo
15        ppp 2010-01-12      Bar
16        qqq 2010-01-17      Foo
17        rrr 2010-01-17      Bar
18        sss 2010-01-17      Foo
19        ttt 2010-01-17      Bar

# File 2

         Date PersonID  Sum of Chats
0  2010-01-01      Foo             5
1  2010-01-02      Foo             3
2  2010-01-03      Foo            24
3  2010-01-06      Foo             7
4  2010-01-09      Foo            15
5  2010-01-10      Foo             9
6  2010-01-11      Foo            16
7  2010-01-12      Foo            22
8  2010-01-17      Foo            14
9  2010-01-01      Bar             8
10 2010-01-02      Bar            15
11 2010-01-03      Bar            14
12 2010-01-06      Bar            29
13 2010-01-09      Bar            11
14 2010-01-10      Bar             6
15 2010-01-11      Bar            14
16 2010-01-12      Bar            30
17 2010-01-17      Bar            12

如果您希望使用CustomerID计算访问次数,可以使用pivot_table来快速汇总。

In [4]: df1 = df1.pivot_table(index=['Date','PersonID'], values='CustomerID', aggfunc=len)
   ...: print df1
Date        PersonID
2010-01-01  Foo         1
2010-01-02  Bar         1
2010-01-03  Bar         1
            Foo         2
2010-01-06  Bar         2
            Foo         1
2010-01-09  Foo         1
2010-01-10  Bar         1
2010-01-11  Foo         1
2010-01-12  Bar         3
            Foo         2
2010-01-17  Bar         2
            Foo         2
Name: CustomerID, dtype: int64

我倾向于在汇总而不是其他方法时将其与reset_index结合使用,因为我在旋转时可以获得上述有意义的数据。

In [5]: df1 = df1.reset_index(); print df1
         Date PersonID  CustomerID
0  2010-01-01      Foo           1
1  2010-01-02      Bar           1
2  2010-01-03      Bar           1
3  2010-01-03      Foo           2
4  2010-01-06      Bar           2
5  2010-01-06      Foo           1
6  2010-01-09      Foo           1
7  2010-01-10      Bar           1
8  2010-01-11      Foo           1
9  2010-01-12      Bar           3
10 2010-01-12      Foo           2
11 2010-01-17      Bar           2
12 2010-01-17      Foo           2

所以我们刚开始。剩下的步骤是将它与第二个数据帧合并,以便按人均每个日期进行聊天。

In [6]: df = pd.merge(df1, df2, how='outer', sort=True)
   ...: print df
         Date PersonID  CustomerID  Sum of Chats
0  2010-01-01      Bar         NaN             8
1  2010-01-01      Foo           1             5
2  2010-01-02      Bar           1            15
3  2010-01-02      Foo         NaN             3
4  2010-01-03      Bar           1            14
5  2010-01-03      Foo           2            24
6  2010-01-06      Bar           2            29
7  2010-01-06      Foo           1             7
8  2010-01-09      Bar         NaN            11
9  2010-01-09      Foo           1            15
10 2010-01-10      Bar           1             6
11 2010-01-10      Foo         NaN             9
12 2010-01-11      Bar         NaN            14
13 2010-01-11      Foo           1            16
14 2010-01-12      Bar           3            30
15 2010-01-12      Foo           2            22
16 2010-01-17      Bar           2            12
17 2010-01-17      Foo           2            14

当然,NaNs是我的错误模拟数据设置的产物。从这里开始,这只是直接计算。

如果有帮助,请告诉我。