我有以下数据框df,列为“类”
Class
0 Individual
1 Group
2 A
3 B
4 C
5 D
6 Group
我想用“其他”替换“组”和“个人”之外的所有内容,所以最终的数据帧是
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
数据帧巨大,超过60万行。最佳查找“组”和“个体”以外的值并将其替换为“其他”的最佳方法是什么?
我看到了替换示例,例如:
df['Class'] = df['Class'].replace({'A':'Other', 'B':'Other'})
但是由于我拥有的大量唯一值太多,所以我无法单独执行此操作。我只想使用“组”和“个人”的排除子集。
答案 0 :(得分:5)
我认为需要:
df['Class'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
print (df)
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
另一种解决方案(较慢):
m = (df['Class'] == 'Individual') | (df['Class'] == 'Group')
df['Class'] = np.where(m, df['Class'], 'Other')
另一种解决方案:
df['Class'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
性能(实际数据取决于替换次数)
#[700000 rows x 1 columns]
df = pd.concat([df] * 100000, ignore_index=True)
#print (df)
In [208]: %timeit df['Class1'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
25.9 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [209]: %timeit df['Class2'] = np.where((df['Class'] == 'Individual') | (df['Class'] == 'Group'), df['Class'], 'Other')
120 ms ± 6.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [210]: %timeit df['Class3'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
95.7 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [211]: %timeit df.loc[~df['Class'].isin(['Individual', 'Group']), 'Class'] = 'Other'
97.8 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
答案 1 :(得分:2)
另一种方法可能是:
df.loc[~df['Class'].isin(['Individual', 'Group']), 'Class'] = 'Other'
答案 2 :(得分:1)
您可以通过这种方式进行操作
list = df['Class'].unique()
list.remove('Individual')
。...df[df.class is in list]
df[df.class is in list].class = 'Other'
抱歉,此伪伪代码,但原理相同。
答案 3 :(得分:1)
您可以使用pd.Series.where
:
df['Class'].where(df['Class'].isin(['Individual', 'Group']), 'Other', inplace=True)
print(df)
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
与map
+ fillna
相比,这应该是有效的:
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df['Class'].where(df['Class'].isin(['Individual', 'Group']), 'Other')
# 60.3 ms per loop
%timeit df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
# 133 ms per loop