我是熊猫新手,但是通过stackoverflow,让事情发挥作用。这当前有效,但需要大约30分钟(相当大的数据集)。想知道是否有办法加快这个速度?基本上尝试将“状态”列的各种不同组合与“Current_Status”列进行映射。谢谢!
df_new = df.groupby('id').apply(lambda x: pd.Series(dict(
new_col1=(x['foo'] != np.nan).sum(),
new_col2=(x['bar'] == 'P').sum(),
new_col3=(x['bar'] == 'C').sum(),
new_col3=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, paid')).sum(),
new_col4=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, not yet paid')).sum(),
new_col5=((x['Status']=='Approved, paid') & (x['Current_Status']=='Approved, paid')).sum()
)))
df结构示例:
In[15]: df.head(6)
Out[15]:
id foo bar Status Current_Status
0 1 23 'C' 'Approved, paid' 'Approved, paid'
1 1 63 'P' 'Approved, not yet paid' 'Approved, paid'
2 1 84 'P' 'Approved, paid' 'Approved, paid'
3 1 125 'P' 'Approved, not yet paid' 'Approved, not yet paid'
4 1 216 'P' 'Approved, not yet paid' 'Approved, paid'
5 1 12 'C' 'Approved, paid' 'Approved, paid'
答案 0 :(得分:1)
您可以尝试notnull
和numpy.in1d
:
df_new1 = df.groupby('id').apply(lambda x: pd.Series(dict(
new_col1=(x['foo'].notnull()).sum(),
new_col2=np.in1d(x['bar'],'P').sum(),
new_col3=np.in1d(x['bar'],'C').sum(),
new_col4=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum(),
new_col5=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, not yet paid'])).sum(),
new_col6=(np.in1d(x['Status'],['Approved, paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum()
)))
另一种更快的解决方案是按factorize
将值转换为值0
和1
,然后按abs
创建倒置列,使用{{3}创建最后groupby
}:
df['new_col1'] = df['foo'].notnull().astype(int)
df['new_col2'] = df['bar'].factorize()[0]
df['new_col3'] = (df['new_col2'] - 1).abs()
df['Status'] = df['Status'].factorize()[0]
df['invertStatus'] = (df['Status'] - 1).abs()
df['Current_Status'] = df['Current_Status'].factorize()[0]
df['invertCurrent_Status'] = (df['Current_Status'] - 1).abs()
df['new_col4'] = df['Status'] & df['invertCurrent_Status']
df['new_col5'] = df['Status'] & df['Current_Status']
df['new_col6'] = df['invertStatus'] & df['invertCurrent_Status']
print df.groupby('id').sum()
[['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']]
或者您可以创建布尔Series
- 最快的解决方案:
df['new_col1'] = df['foo'].notnull()
df['new_col2'] = np.in1d(df['bar'], 'P')
df['new_col3'] = ~df['new_col2']
Status = np.in1d(df['Status'],'Approved, not yet paid')
invertStatus = ~Status
Current_Status = np.in1d(df['Current_Status'],'Approved, not yet paid')
invertCurrent_Status = ~Current_Status
df['new_col4'] = Status & invertCurrent_Status
df['new_col5'] = Status & Current_Status
df['new_col6'] = invertStatus & invertCurrent_Status
#print df
print df.groupby('id').sum()
[['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']].astype(int)
<强>计时强>:
In [25]: len(df)
Out[25]: 110000
In [26]: %timeit a(df)
10 loops, best of 3: 24.7 ms per loop
In [27]: %timeit b(df1)
10 loops, best of 3: 39.3 ms per loop
In [28]: %timeit c(df2)
10 loops, best of 3: 46 ms per loop
In [29]: %timeit d(df3)
10 loops, best of 3: 103 ms per loop
<强>代码强>:
df = pd.concat([df]*10000).reset_index(drop=True)
#print df
df1,df2,df3 = df.copy(), df.copy(), df.copy()
def a(df):
df['new_col1'] = df['foo'].notnull()
df['new_col2'] = np.in1d(df['bar'], 'P')
df['new_col3'] = ~df['new_col2']
Status = np.in1d(df['Status'],'Approved, not yet paid')
invertStatus = ~Status
Current_Status = np.in1d(df['Current_Status'],'Approved, not yet paid')
invertCurrent_Status = ~Current_Status
df['new_col4'] = Status & invertCurrent_Status
df['new_col5'] = Status & Current_Status
df['new_col6'] = invertStatus & invertCurrent_Status
#print df
return df.groupby('id').sum()[['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']].astype(int)
def b(df):
df['new_col1'] = df['foo'].notnull().astype(int)
df['new_col2'] = df['bar'].factorize()[0]
df['new_col3'] = (df['new_col2'] - 1).abs()
df['Status'] = df['Status'].factorize()[0]
df['invertStatus'] = (df['Status'] - 1).abs()
df['Current_Status'] = df['Current_Status'].factorize()[0]
df['invertCurrent_Status'] = (df['Current_Status'] - 1).abs()
df['new_col4'] = df['Status'] & df['invertCurrent_Status']
df['new_col5'] = df['Status'] & df['Current_Status']
df['new_col6'] = df['invertStatus'] & df['invertCurrent_Status']
return df.groupby('id').sum()[['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']]
def c(df):
return df.groupby('id').apply(lambda x: pd.Series(dict(new_col1=(x['foo'].notnull()).sum(),new_col2=np.in1d(x['bar'],'P').sum(),new_col3=np.in1d(x['bar'],'C').sum(),new_col4=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum(),new_col5=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, not yet paid'])).sum(),new_col6=(np.in1d(x['Status'],['Approved, paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum(),)))
def d(df):
return df.groupby('id').apply(lambda x: pd.Series(dict(new_col1=(x['foo'] != np.nan).sum(),new_col2=(x['bar'] == 'P').sum(),new_col3=(x['bar'] == 'C').sum(),new_col4=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, paid')).sum(),new_col5=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, not yet paid')).sum(),new_col6=((x['Status']=='Approved, paid') & (x['Current_Status']=='Approved, paid')).sum())))
测试DataFrame :
id foo bar Status Current_Status
0 1 23 C Approved, paid Approved, paid
1 1 63 P Approved, not yet paid Approved, paid
2 1 84 P Approved, paid Approved, paid
3 1 125 P Approved, not yet paid Approved, not yet paid
4 1 12 C Approved, paid Approved, paid
5 2 23 C Approved, paid Approved, paid
6 2 63 P Approved, not yet paid Approved, paid
7 2 84 P Approved, paid Approved, paid
8 2 125 P Approved, not yet paid Approved, not yet paid
9 2 216 P Approved, not yet paid Approved, paid
10 2 12 C Approved, paid Approved, paid