如何修改pd.dataframe中的列值

时间:2018-08-09 06:58:23

标签: python pandas

背景: 实际上,我想修改数据框中的值,仅应保留前20个运动,其他应显示为“其他”。 它是现有列的副本,如下所示:

athlete_events['Sport_modified'] = athlete_events['Sport']

生成包含top20运动名称的过滤器,如下所示:

top20_sport = athlete_events['Sport'].value_counts().head(20).index

修改过程如下: 方法1:

 def classify_sports(cols, filters):
for i in cols:
    if i in filters:
        pass
    else:
        i = 'Others'
classify_sports(athlete_events.Sport_modified, top20_sport)

方法2:

athlete_events.Sport_modified.apply(lambda x : x if x in top20_sport else 'Others')

但是,上面的2方法无效。 我可以像下面的代码一样做的唯一方法:

athlete_events.loc[
(athlete_events['Sport'] !='Athletics')&
(athlete_events['Sport'] !='Gymnastics')&
(athlete_events['Sport'] !='Swimming')&
(athlete_events['Sport'] !='Shooting')&
(athlete_events['Sport'] !='Cycling')&
(athlete_events['Sport'] !='Fencing')&
(athlete_events['Sport'] !='Rowing')&
(athlete_events['Sport'] !='Cross Country Skiing')&
(athlete_events['Sport'] !='Alpine Skiing')&
(athlete_events['Sport'] !='Wrestling')&
(athlete_events['Sport'] !='Football')&
(athlete_events['Sport'] !='Sailing')&
(athlete_events['Sport'] !='Equestrianism')&
(athlete_events['Sport'] !='Canoeing')&
(athlete_events['Sport'] !='Boxing')&
(athlete_events['Sport'] !='Speed Skating')&
(athlete_events['Sport'] !='Ice Hockey')&
(athlete_events['Sport'] !='Hockey')&
(athlete_events['Sport'] !='Biathlon')&
(athlete_events['Sport'] !='Basketball')
,'Sport_modified'] = 'Others'

以上两种方式的问题是什么?感谢帮助。

1 个答案:

答案 0 :(得分:2)

您的第一个方法将永远无法工作,因为您的函数不会返回序列,也不会return进行行计算。

您的第二个方法不是就地 ,您需要分配回一个序列。例如:

df['sport_modified'] = df['sport'].apply(lambda x : x if x in top20_sport else 'Others')

您可以使用pd.Series.isin来更有效地表达最终解决方案,而通过~来否定

L = ['Athletics', 'Gymnastics', ...]

df.loc[~df['sport'].isin(L), 'sport_modified'] = 'Others'