在许多小组中优化熊猫组

时间:2014-06-25 15:10:05

标签: python optimization pandas

我有一个包含许多小组的pandas DataFrame:

In [84]: n=10000

In [85]: df=pd.DataFrame({'group':sorted(range(n)*4),'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)

In [86]: df.head(9)
Out[86]: 
   group  val
0      0    0
1      0    0
2      0    1
3      0    2
4      1    1
5      1    2
6      1    2
7      1    4
8      2    0

我想为那些出现val == 1而不是val == 0的组做些特别的事情。例如。只有当val == 0在该组中时,才将组中的1替换为99。

但是对于这个大小的DataFrame来说它很慢:

In [87]: def f(s):
   ....: if (0 not in s) and (1 in s): s[s==1]=99
   ....: return s
   ....: 

In [88]: %timeit df.groupby('group')['val'].transform(f)
1 loops, best of 3: 11.2 s per loop

循环数据框架更加丑陋但速度更快:

In [89]: %paste

def g(df):
    df.sort(['group','val'],inplace=True)
    last_g=-1
    for i in xrange(len(df)):
        if df.group.iloc[i]!=last_g:
            has_zero=False
        if df.val.iloc[i]==0:
            has_zero=True
        elif has_zero and df.val.iloc[i]==1:
            df.val.iloc[i]=99
    return df
## -- End pasted text --

In [90]: %timeit g(df)
1 loops, best of 3: 2.53 s per loop

但如果可能的话,我想进一步优化它。

知道怎么做吗?

由于


根据杰夫的回答,我得到了一个非常快的解决方案。如果其他人认为它有用,我会把它放在这里:

In [122]: def do_fast(df):
   .....: has_zero_mask=df.group.isin(df[df.val==0].group.unique())
   .....: df.val[(df.val==1) & has_zero_mask]=99
   .....: return df
   .....: 

In [123]: %timeit do_fast(df)
100 loops, best of 3: 11.2 ms per loop

1 个答案:

答案 0 :(得分:1)

不是100%确定这是你想要的,但应该很容易有不同的过滤/设置标准

In [37]: pd.set_option('max_rows',10)

In [38]: np.random.seed(1234)

In [39]: def f():

           # create the frame
           df=pd.DataFrame({'group':sorted(range(n)*4),
                                 'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)


           df['result'] = np.nan

           # Create a count per group
           df['counter'] = df.groupby('group').cumcount()

           # select which values you want, returning the indexes of those
           mask = df[df.val==1].groupby('group').grouper.group_info[0]

           # set em
           df.loc[df.index.isin(mask) & df['counter'] == 1,'result'] = 99


In [40]: %timeit f()
10 loops, best of 3: 95 ms per loop

In [41]: df
Out[41]: 
       group  val  result  counter
0          0    3     NaN        0
1          0    4      99        1
2          0    4     NaN        2
3          0    5      99        3
4          1    0     NaN        0
...      ...  ...     ...      ...
39995   9998    4     NaN        3
39996   9999    0     NaN        0
39997   9999    0     NaN        1
39998   9999    2     NaN        2
39999   9999    3     NaN        3

[40000 rows x 4 columns]