Python / Pandas:按日期和ID对记录进行分组和计数

时间:2017-04-06 19:29:23

标签: python pandas grouping counting data-munging

我在Python中有一个相对较大的数据框(~10 ^ 6条记录),结构如下:

Index,Date,City,State,ID,County,Age,A,B,C
0,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,9/1/16,X,AL,360,BB County,1.0,negative,negative,negative
2,9/1/16,X,AL,360,BB County,10.0,negative,negative,negative
3,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,9/1/16,X,AR,728,JJ County,3.0,negative,negative,negative
6,9/1/16,X,AR,728,JJ County,8.0,negative,negative,negative
7,9/1/16,X,AR,728,JJ County,8.0,negative,negative,negative
8,9/1/16,X,AR,728,JJ County,14.0,negative,negative,negative
9,9/1/16,X,AR,728,JJ County,5.0,negative,negative,negative
...

我尝试按日期(日)和ID分组,然后计算1)每天和ID的记录总数,以及2)“A”列中“肯定”的总数(例如)每天和ID。最后,我想填充一个数据框,指出每天和ID的正数和总记录数,例如,

Date,ID,Positive,Total
9/1/16,360,10,20
9/2/16,360,12,23
9/2/16,718,2,43
...

我最初使用了一个双重for循环,经历了每个独特的日子和ID,但这花费了太多时间。我希望有一个更好的方法的帮助。提前感谢您的任何意见!

1 个答案:

答案 0 :(得分:1)

我获取了您提供的数据并创建了一个小的.csv文件,因此您可以复制...此外,我更改了几个值来测试它的工作原理:

* 1

一旦你阅读了它,看看事情的样子:

Index,Date,City,State,ID,County,Age,A,B,C
0,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,9/1/16,X,AL,360,BB County,1.0,positive,negative,negative
2,9/1/16,X,AL,360,BB County,10.0,positive,negative,negative
3,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,9/2/16,X,AR,728,JJ County,3.0,negative,negative,negative
6,9/2/16,X,AR,728,JJ County,8.0,positive,negative,negative
7,9/2/16,X,AR,728,JJ County,8.0,negative,negative,negative
8,9/3/16,X,AR,728,JJ County,14.0,negative,negative,negative
9,9/3/16,X,AR,728,JJ County,5.0,negative,negative,negative 

这是应用于>>> X = pd.read_csv('data.csv', header=0, index_col=None).drop('Index', axis=1) >>> print(X) Date City State ID County Age A B C 0 9/1/16 X AL 360 BB County 29.0 negative positive positive 1 9/1/16 X AL 360 BB County 1.0 positive negative negative 2 9/1/16 X AL 360 BB County 10.0 positive negative negative 3 9/1/16 X AL 360 BB County 11.0 negative negative negative 4 9/1/16 X AR 718 LL County 67.0 negative negative negative 5 9/2/16 X AR 728 JJ County 3.0 negative negative negative 6 9/2/16 X AR 728 JJ County 8.0 positive negative negative 7 9/2/16 X AR 728 JJ County 8.0 negative negative negative 8 9/3/16 X AR 728 JJ County 14.0 negative negative negative 9 9/3/16 X AR 728 JJ County 5.0 negative negative negative 调用中每个组的功能:

groupby

这将分为两个步骤...使用pandas,您可以将多个列分组并应用上述功能。

def _ct_id_pos(grp):
    return grp[grp.A == 'positive'].shape[0], grp.shape[0]

请注意,groupby函数的结果为我们提供了一个包含嵌入元组的新列,因此下一步是将这些列拆分为各自的列并删除嵌入的列:

# the following will have the tuple in one column
>>> X_prime = X.groupby(['Date', 'ID']).apply(_ct_id_pos).reset_index()
>>> print(X_prime)
     Date   ID       0
0  9/1/16  360  (2, 4)
1  9/1/16  718  (0, 1)
2  9/2/16  728  (1, 3)
3  9/3/16  728  (0, 2)