我有一个包含许多小组的pandas DataFrame:
In [84]: n=10000
In [85]: df=pd.DataFrame({'group':sorted(range(n)*4),'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)
In [86]: df.head(9)
Out[86]:
group val
0 0 0
1 0 0
2 0 1
3 0 2
4 1 1
5 1 2
6 1 2
7 1 4
8 2 0
我想为那些出现val == 1而不是val == 0的组做些特别的事情。例如。只有当val == 0在该组中时,才将组中的1替换为99。
但是对于这个大小的DataFrame来说它很慢:
In [87]: def f(s):
....: if (0 not in s) and (1 in s): s[s==1]=99
....: return s
....:
In [88]: %timeit df.groupby('group')['val'].transform(f)
1 loops, best of 3: 11.2 s per loop
循环数据框架更加丑陋但速度更快:
In [89]: %paste
def g(df):
df.sort(['group','val'],inplace=True)
last_g=-1
for i in xrange(len(df)):
if df.group.iloc[i]!=last_g:
has_zero=False
if df.val.iloc[i]==0:
has_zero=True
elif has_zero and df.val.iloc[i]==1:
df.val.iloc[i]=99
return df
## -- End pasted text --
In [90]: %timeit g(df)
1 loops, best of 3: 2.53 s per loop
但如果可能的话,我想进一步优化它。
知道怎么做吗?
由于
根据杰夫的回答,我得到了一个非常快的解决方案。如果其他人认为它有用,我会把它放在这里:
In [122]: def do_fast(df):
.....: has_zero_mask=df.group.isin(df[df.val==0].group.unique())
.....: df.val[(df.val==1) & has_zero_mask]=99
.....: return df
.....:
In [123]: %timeit do_fast(df)
100 loops, best of 3: 11.2 ms per loop
答案 0 :(得分:1)
不是100%确定这是你想要的,但应该很容易有不同的过滤/设置标准
In [37]: pd.set_option('max_rows',10)
In [38]: np.random.seed(1234)
In [39]: def f():
# create the frame
df=pd.DataFrame({'group':sorted(range(n)*4),
'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)
df['result'] = np.nan
# Create a count per group
df['counter'] = df.groupby('group').cumcount()
# select which values you want, returning the indexes of those
mask = df[df.val==1].groupby('group').grouper.group_info[0]
# set em
df.loc[df.index.isin(mask) & df['counter'] == 1,'result'] = 99
In [40]: %timeit f()
10 loops, best of 3: 95 ms per loop
In [41]: df
Out[41]:
group val result counter
0 0 3 NaN 0
1 0 4 99 1
2 0 4 NaN 2
3 0 5 99 3
4 1 0 NaN 0
... ... ... ... ...
39995 9998 4 NaN 3
39996 9999 0 NaN 0
39997 9999 0 NaN 1
39998 9999 2 NaN 2
39999 9999 3 NaN 3
[40000 rows x 4 columns]