熊猫:如何计算由ID分组的分类特征的出现次数

时间:2017-06-02 05:34:12

标签: python pandas numpy

假设我有这个DataFrame df

My_ID   My_CAT
  1       A
  2       B  
  3       C
  1       A  
  1       B 
  2       D 

我想知道每个不同My_Cat的每个不同My_ID值的出现次数。

我需要像

这样的密集阵列
My_ID   A    B    C   D
  1     2    1    0   0
  2     0    1    0   1
  3     0    0    1   0

我尝试了

df.groupby(['My_ID','My_CAT']).count()

但是虽然我看到数据根据我的需要进行分组但不计算事件。

2 个答案:

答案 0 :(得分:3)

使用crosstab(减少打字,最慢):

df = pd.crosstab(df['My_ID'], df['My_CAT'])
print (df)
My_CAT  A  B  C  D
My_ID             
1       2  1  0  0
2       0  1  0  1
3       0  0  1  0

使用groupby +汇总size + unstack加快解决方案:

df = df.groupby(['My_ID','My_CAT']).size().unstack(fill_value=0)
print (df)
My_CAT  A  B  C  D
My_ID             
1       2  1  0  0
2       0  1  0  1
3       0  0  1  0

最后:

df = df.reset_index().rename_axis(None, axis=1)
print (df)
   My_ID  A  B  C  D
0      1  2  1  0  0
1      2  0  1  0  1
2      3  0  0  1  0

注意:

What is the difference between size and count in pandas?

计时(更大的数据):

np.random.seed(123)
N = 100000
L = list('abcdefghijklmno')
df = pd.DataFrame({'My_CAT': np.random.choice(L, N),
                   'My_ID':np.random.randint(1000,size=N)})
print (df)

In [79]: %timeit pd.crosstab(df['My_ID'], df['My_CAT'])
10 loops, best of 3: 96.7 ms per loop

In [80]: %timeit df.groupby(['My_ID','My_CAT']).size().unstack(fill_value=0)
100 loops, best of 3: 14.2 ms per loop

In [81]: %timeit pd.get_dummies(df.My_CAT).groupby(df.My_ID).sum()
10 loops, best of 3: 25.5 ms per loop

In [82]: %timeit df.groupby('My_ID').My_CAT.value_counts().unstack(fill_value=0)
10 loops, best of 3: 25.4 ms per loop

In [136]: %timeit xtab_df(df, 'My_ID', 'My_CAT')
100 loops, best of 3: 4.23 ms per loop

In [137]: %timeit xtab(df, 'My_ID', 'My_CAT')
100 loops, best of 3: 4.61 ms per loop

答案 1 :(得分:2)

pd.get_dummies groupby

pd.get_dummies(df.My_CAT).groupby(df.My_ID).sum().reset_index()

   My_ID  A  B  C  D
0      1  2  1  0  0
1      2  0  1  0  1
2      3  0  0  1  0

groupby value_counts

df.groupby('My_ID').My_CAT.value_counts() \
  .unstack(fill_value=0).rename_axis(None, 1).reset_index()

   My_ID  A  B  C  D
0      1  2  1  0  0
1      2  0  1  0  1
2      3  0  0  1  0

factorizenumba
这是我的实验提案

from numba import njit
import pandas as pd
import numpy as np

@njit
def xtab_array(f1, f2, m, n):
    v = np.arange(m * n).reshape(m, n) * 0
    for i in range(f1.size):
        v[f1[i], f2[i]] += 1
    return v

def xtab_df(df, c1, c2):
    f1, u1 = pd.factorize(df[c1].values)
    f2, u2 = pd.factorize(df[c2].values)
    v = xtab_array(f1, f2, u1.size, u2.size)
    return pd.DataFrame(
        np.column_stack([u1, v]), columns=['My_ID'] + u2.tolist()
    )

xtab_df(df, 'My_ID', 'My_CAT')

   My_ID  A  B  C  D
0      1  2  1  0  0
1      2  0  1  0  1
2      3  0  0  1  0

numpy

def xtab(df, c1, c2):
    f1, u1 = pd.factorize(df[c1].values)
    f2, u2 = pd.factorize(df[c2].values)
    n, m = u1.size, u2.size
    v = np.bincount(f1 * m + f2)
    v = np.append(v, np.zeros(n * m - v.size)).reshape(n, -1)
    return pd.DataFrame(
        np.column_stack([u1, v]), columns=['My_ID'] + u2.tolist()
    )

xtab(df, 'My_ID', 'My_CAT')

   My_ID  A  B  C  D
0      1  2  1  0  0
1      2  0  1  0  1
2      3  0  0  1  0

<强>时序
小数据

%timeit pd.crosstab(df['My_ID'], df['My_CAT'])
%timeit df.groupby(['My_ID','My_CAT']).size().unstack(fill_value=0)
%timeit pd.get_dummies(df.My_CAT).groupby(df.My_ID).sum()
%timeit df.groupby('My_ID').My_CAT.value_counts().unstack(fill_value=0)
%timeit xtab_df(df, 'My_ID', 'My_CAT')
%timeit xtab(df, 'My_ID', 'My_CAT')

100 loops, best of 3: 5.21 ms per loop
1000 loops, best of 3: 1.23 ms per loop
1000 loops, best of 3: 1.2 ms per loop
1000 loops, best of 3: 1.23 ms per loop
1000 loops, best of 3: 280 µs per loop
1000 loops, best of 3: 298 µs per loop

@ jezrael的更大数据

np.random.seed(123)
N = 100000
L = list('abcdefghijklmno')
df = pd.DataFrame({'My_CAT': np.random.choice(L, N),
                   'My_ID':np.random.randint(1000,size=N)})

%timeit pd.crosstab(df['My_ID'], df['My_CAT'])
%timeit df.groupby(['My_ID','My_CAT']).size().unstack(fill_value=0)
%timeit pd.get_dummies(df.My_CAT).groupby(df.My_ID).sum()
%timeit df.groupby('My_ID').My_CAT.value_counts().unstack(fill_value=0)
%timeit xtab_df(df, 'My_ID', 'My_CAT')
%timeit xtab(df, 'My_ID', 'My_CAT')

10 loops, best of 3: 82.6 ms per loop
100 loops, best of 3: 10.7 ms per loop
100 loops, best of 3: 15.6 ms per loop
10 loops, best of 3: 19.9 ms per loop
100 loops, best of 3: 3.01 ms per loop
100 loops, best of 3: 3.22 ms per loop