Pandas,groupby的长处理时间和比较

时间:2016-03-21 13:32:16

标签: python pandas

我是熊猫新手,但是通过stackoverflow,让事情发挥作用。这当前有效,但需要大约30分钟(相当大的数据集)。想知道是否有办法加快这个速度?基本上尝试将“状态”列的各种不同组合与“Current_Status”列进行映射。谢谢!

df_new = df.groupby('id').apply(lambda x: pd.Series(dict(   
new_col1=(x['foo'] != np.nan).sum(),    
new_col2=(x['bar'] == 'P').sum(),
new_col3=(x['bar'] == 'C').sum(),
new_col3=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, paid')).sum(),
new_col4=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, not yet paid')).sum(),
new_col5=((x['Status']=='Approved, paid') & (x['Current_Status']=='Approved, paid')).sum()
)))

df结构示例:

In[15]: df.head(6)
Out[15]:
   id   foo  bar  Status                   Current_Status
0  1    23   'C'  'Approved, paid'         'Approved, paid'
1  1    63   'P'  'Approved, not yet paid' 'Approved, paid'
2  1    84   'P'  'Approved, paid'         'Approved, paid'
3  1    125  'P'  'Approved, not yet paid' 'Approved, not yet paid'
4  1    216  'P'  'Approved, not yet paid' 'Approved, paid'
5  1    12   'C'  'Approved, paid'         'Approved, paid'

1 个答案:

答案 0 :(得分:1)

您可以尝试notnullnumpy.in1d

df_new1 = df.groupby('id').apply(lambda x: pd.Series(dict(
 new_col1=(x['foo'].notnull()).sum(),
 new_col2=np.in1d(x['bar'],'P').sum(),
 new_col3=np.in1d(x['bar'],'C').sum(),
 new_col4=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum(),
 new_col5=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, not yet paid'])).sum(),
 new_col6=(np.in1d(x['Status'],['Approved, paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum()
)))

另一种更快的解决方案是按factorize将值转换为值01,然后按abs创建倒置列,使用{{3}创建最后groupby }:

df['new_col1'] = df['foo'].notnull().astype(int)
df['new_col2'] = df['bar'].factorize()[0]
df['new_col3'] = (df['new_col2'] - 1).abs()
df['Status'] =  df['Status'].factorize()[0]
df['invertStatus'] = (df['Status'] - 1).abs()
df['Current_Status'] = df['Current_Status'].factorize()[0]
df['invertCurrent_Status'] = (df['Current_Status'] - 1).abs()

df['new_col4'] = df['Status'] & df['invertCurrent_Status']
df['new_col5'] = df['Status'] & df['Current_Status']
df['new_col6'] = df['invertStatus'] & df['invertCurrent_Status']

print df.groupby('id').sum()
                        [['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']]

或者您可以创建布尔Series - 最快的解决方案:

df['new_col1'] = df['foo'].notnull()
df['new_col2'] = np.in1d(df['bar'], 'P')
df['new_col3'] = ~df['new_col2']
Status =  np.in1d(df['Status'],'Approved, not yet paid')
invertStatus = ~Status
Current_Status = np.in1d(df['Current_Status'],'Approved, not yet paid')
invertCurrent_Status = ~Current_Status

df['new_col4'] = Status & invertCurrent_Status
df['new_col5'] = Status & Current_Status
df['new_col6'] = invertStatus & invertCurrent_Status
#print df

print df.groupby('id').sum()
        [['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']].astype(int)

<强>计时

In [25]: len(df)
Out[25]: 110000

In [26]: %timeit a(df)
10 loops, best of 3: 24.7 ms per loop

In [27]: %timeit b(df1)
10 loops, best of 3: 39.3 ms per loop

In [28]: %timeit c(df2)
10 loops, best of 3: 46 ms per loop

In [29]: %timeit d(df3)
10 loops, best of 3: 103 ms per loop

<强>代码

df = pd.concat([df]*10000).reset_index(drop=True)    
#print df
df1,df2,df3 = df.copy(), df.copy(), df.copy()


def a(df):
    df['new_col1'] = df['foo'].notnull()
    df['new_col2'] = np.in1d(df['bar'], 'P')
    df['new_col3'] = ~df['new_col2']
    Status =  np.in1d(df['Status'],'Approved, not yet paid')
    invertStatus = ~Status
    Current_Status = np.in1d(df['Current_Status'],'Approved, not yet paid')
    invertCurrent_Status = ~Current_Status
    df['new_col4'] = Status & invertCurrent_Status
    df['new_col5'] = Status & Current_Status
    df['new_col6'] = invertStatus & invertCurrent_Status
    #print df
    return df.groupby('id').sum()[['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']].astype(int)

def b(df):
    df['new_col1'] = df['foo'].notnull().astype(int)
    df['new_col2'] = df['bar'].factorize()[0]
    df['new_col3'] = (df['new_col2'] - 1).abs()
    df['Status'] =  df['Status'].factorize()[0]
    df['invertStatus'] = (df['Status'] - 1).abs()
    df['Current_Status'] = df['Current_Status'].factorize()[0]
    df['invertCurrent_Status'] = (df['Current_Status'] - 1).abs()

    df['new_col4'] = df['Status'] & df['invertCurrent_Status']
    df['new_col5'] = df['Status'] & df['Current_Status']
    df['new_col6'] = df['invertStatus'] & df['invertCurrent_Status']

    return df.groupby('id').sum()[['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']]    
def c(df):
    return df.groupby('id').apply(lambda x: pd.Series(dict(new_col1=(x['foo'].notnull()).sum(),new_col2=np.in1d(x['bar'],'P').sum(),new_col3=np.in1d(x['bar'],'C').sum(),new_col4=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum(),new_col5=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, not yet paid'])).sum(),new_col6=(np.in1d(x['Status'],['Approved, paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum(),)))

def d(df):
    return df.groupby('id').apply(lambda x: pd.Series(dict(new_col1=(x['foo'] != np.nan).sum(),new_col2=(x['bar'] == 'P').sum(),new_col3=(x['bar'] == 'C').sum(),new_col4=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, paid')).sum(),new_col5=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, not yet paid')).sum(),new_col6=((x['Status']=='Approved, paid') & (x['Current_Status']=='Approved, paid')).sum())))

测试DataFrame

    id  foo bar                  Status          Current_Status
0    1   23   C          Approved, paid          Approved, paid
1    1   63   P  Approved, not yet paid          Approved, paid
2    1   84   P          Approved, paid          Approved, paid
3    1  125   P  Approved, not yet paid  Approved, not yet paid
4    1   12   C          Approved, paid          Approved, paid
5    2   23   C          Approved, paid          Approved, paid
6    2   63   P  Approved, not yet paid          Approved, paid
7    2   84   P          Approved, paid          Approved, paid
8    2  125   P  Approved, not yet paid  Approved, not yet paid
9    2  216   P  Approved, not yet paid          Approved, paid
10   2   12   C          Approved, paid          Approved, paid