我有数据框df
Transportation_Mode time_delta trip_id segmentid Vincenty_distance velocity acceleration jerk
walk 1 1 1 1.551676553 1.551676553 0.550163852 -1.017629555
walk 1 1 1 1.70920675 1.70920675 0.16257622 -0.39166534
walk 1 1 1 1.871782971 1.871782971 -0.22908912 -0.734438511
walk 12 1 1 23.16466284 1.93038857 0.324972586 -0.331839143
walk 1 1 1 5.830059603 5.830059603 -3.657097132 2.614438854
bus 1 16 5 8.418372046 8.418372046 -7.259019484 7.40735053
bus 23 16 5 26.66510892 1.159352562 0.148331046 -0.036318522
bus 1 16 5 4.570966614 4.570966614 -0.68699497 -0.889126918
我想根据百分比值[0.05,0.95]删除每组Transportation_Mode中的异常值
我的问题类似于讨论Remove outliers in Pandas dataframe with groupby
我写的代码是
res = df.groupby("Transportation_Mode")["Vincenty_distance"].quantile([0.05, 0.95]).unstack(level=1)
df.loc[ (res.loc[ df.Transportation_Mode, 0.05] < df.Vincenty_distance.values) & (df.Vincenty_distance.values < res.loc[df.Transportation_Mode, 0.95]) ]
但出现错误,ValueError:无法从重复的轴重新索引。我不知道我在哪里错。
完整的输入数据可通过链接https://drive.google.com/file/d/1JjvS7igTmrtLA4E5Rs5D6tsdAXqzpYqX/view?usp=sharing
获得。答案 0 :(得分:2)
将map
用于Series
,其大小与原始DataFrame
相同,因此可以进行过滤:
m1 = (df.Transportation_Mode.map(res[0.05]) < df.Vincenty_distance)
m2 = (df.Vincenty_distance.values < df.Transportation_Mode.map(res[0.95]))
df = df[m1 & m2]
print (df)
Transportation_Mode time_delta trip_id segmentid Vincenty_distance \
1 walk 1 1 1 1.709207
2 walk 1 1 1 1.871783
4 walk 1 1 1 5.830060
5 bus 1 16 5 8.418372
velocity acceleration jerk
1 1.709207 0.162576 -0.391665
2 1.871783 -0.229089 -0.734439
4 5.830060 -3.657097 2.614439
5 8.418372 -7.259019 7.407351
答案 1 :(得分:2)
实际上,如果我们看到了
(res.loc[ df.Transportation_Mode, 0.05] < df.Vincenty_distance.values) & (df.Vincenty_distance.values < res.loc[df.Transportation_Mode, 0.95])
返回一系列类型bool
,可以选择原始df
中的行。我们只需要给出该序列的值,就可以为其添加.values
,同时将其赋给df.loc[]
。下面应该可以工作:
df.loc[ ((res.loc[ df.Transportation_Mode, 0.05] < df.Vincenty_distance.values) & (df.Vincenty_distance.values < res.loc[df.Transportation_Mode, 0.95])).values]