我有一个大数据框,我想根据其他列的值更改某些行的值。我的 for 循环的问题是大型数据集需要很长时间。
数据框的列类似于
标签 | 预测 | proba_label1 | proba_label2 | proba_label3 |
---|---|---|---|---|
标签1 | label2 | 0.3 | 0.6 | 0.1 |
在这种情况下,由于 pred_label2 < 0.9,列“预测”的值应更改为“标签1”
for i, row in df.iterrows():
pred_label = row['prediction']
proba_label = 'proba_' + pred_label
probability = row[proba_label]
if probability <= 0.9:
df.at[i, 'prediction'] = row['label']
示例 DF
data = {'host': ['A','B','A'],
'label': ['label1', 'label2', 'label1'],
'prediction': ['label1', 'label3', 'label3'],
'proba_label1': [0.9, 0.03, 0.2],
'proba_label3': [0.1, 0.95, 0.75],
'proba_label2': [0, 0.02, 0.05]
}
df = pd.DataFrame(data)
答案 0 :(得分:3)
从示例数据和可能的上下文(具有用于分类的 softmax 函数的机器学习模型),很明显,初始预测始终是具有最高概率的标签。
您可以利用这一事实来避免任何循环或查找:
proba_max = np.max([df.proba_label1, df.proba_label2, df.proba_label3], axis=0)
df['prediction'] = np.where(proba_max <= 0.9, df['label'], df['prediction'])
答案 1 :(得分:2)
如果相关概率始终是最大的,则使用只有 max
列的 proba_
:
df['prediction'] = np.where(df.filter(like='proba_').max(axis=1) <= 0.9,
df['label'],
df['prediction'])
在按列名称 (instead lookup
) 选择时使用熔解,然后在 numpy.where
中设置新值:
melt = df.melt(['label','prediction'], ignore_index=False)
df['val'] = melt.loc['proba_' + melt['prediction'] == melt['variable'], 'value']
df['prediction'] = np.where(df['val'] <= 0.9, df['label'], df['prediction'])
print (df)
host label prediction proba_label1 proba_label3 proba_label2 val
0 A label1 label1 0.90 0.10 0.00 0.9
1 B label2 label3 0.03 0.95 0.02 0.95
2 A label1 label1 0.20 0.75 0.05 0.75
没有帮助列的解决方案:
melt = df.melt(['label','prediction'], ignore_index=False)
s = melt.loc['proba_' + melt['prediction'] == melt['variable'], 'value']
df['prediction'] = np.where(s <= 0.9, df['label'], df['prediction'])
#if some labels not match this is safer like np.where
#df.loc[s <= 0.9, 'prediction'] = df['label']
print (df)
host label prediction proba_label1 proba_label3 proba_label2
0 A label1 label1 0.90 0.10 0.00
1 B label2 label3 0.03 0.95 0.02
2 A label1 label1 0.20 0.75 0.05
性能:
data = {'host': ['A','B','A'],
'label': ['label1', 'label2', 'label1'],
'prediction': ['label1', 'label3', 'label3'],
'proba_label1': [0.9, 0.03, 0.2],
'proba_label3': [0.1, 0.95, 0.75],
'proba_label2': [0, 0.02, 0.05]
}
df = pd.DataFrame(data)
#[30000 rows
df = pd.concat([df] * 10000, ignore_index=True)
#deleted answer by @Nk03
In [85]: %timeit df.apply( lambda x: x['label'] if x[f"proba_{x['prediction']}"] <= 0.9 else x['prediction'], 1)
455 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [86]: %timeit df.apply(fun, axis=1)
482 ms ± 58.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [87]: %%timeit
...: melt = df.melt(['label','prediction'], ignore_index=False)
...: df['val'] = melt.loc['proba_' + melt['prediction'] == melt['variable'], 'value']
...:
...: df['prediction'] = np.where(df['val'] <= 0.9, df['label'], df['prediction'])
...:
72.2 ms ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)