我有多个简单功能需要在数据框某些列的每一行上实现。数据框非常像一千万行。我的数据框是这样的:
Date location city number value
12/3/2018 NY New York 2 500
12/1/2018 MN Minneapolis 3 600
12/2/2018 NY Rochester 1 800
12/3/2018 WA Seattle 2 400
我具有以下功能:
def normalized_location(row):
if row['city'] == " Minneapolis":
return "FCM"
elif row['city'] == "Seattle":
return "FCS"
else:
return "Other"
然后我使用:
df['Normalized Location'] =df.apply (lambda row: normalized_location (row),axis=1)
这太慢了,我该如何提高效率?
答案 0 :(得分:6)
我们可以将map
与defaultdict
配合使用来快速完成此功能。
from collections import defaultdict
d = defaultdict(lambda: 'Other')
d.update({"Minneapolis": "FCM", "Seattle": "FCS"})
df['normalized_location'] = df['city'].map(d)
print(df)
Date location city number value normalized_location
0 12/3/2018 NY New York 2 500 Other
1 12/1/2018 MN Minneapolis 3 600 FCM
2 12/2/2018 NY Rochester 1 800 Other
3 12/3/2018 WA Seattle 2 400 FCS
...出于性能原因,规避了fillna
通话。这种方法很容易推广到多个替代品。
答案 1 :(得分:5)
您可能要使用np.select
:
conds = [df.city == 'Minneapolis', df.city == 'Seattle']
choices = ['FCM', 'FCS']
df['normalized_location'] = np.select(conds, choices, default='other')
>>> df
Date location city number value normalized_location
0 12/3/2018 NY New York 2 500 other
1 12/1/2018 MN Minneapolis 3 600 FCM
2 12/2/2018 NY Rochester 1 800 other
3 12/3/2018 WA Seattle 2 400 FCS
答案 2 :(得分:2)
您可以使用嵌套的np.where()
:
df['city'] = np.where(df['city']=='Minneapolis', 'FCM', np.where(df['city']=='Seattle', 'FCS', 'Other'))
答案 3 :(得分:1)
尝试以下方法:
map_ = {'Minneapolis':'FCM', 'Seattle':'FCS'}
df.loc[:,'city'] = df.loc[:,'city'].map(map_).fillna('Other')
print(df)
Date location city number value
0 12/3/2018 NY Other 2 500
1 12/1/2018 MN FCM 3 600
2 12/2/2018 NY Other 1 800
3 12/3/2018 WA FCS 2 400