我正在转换一些numpy代码以使用pandas class CustomMetrics(keras.callbacks.Callback):
def __init__(self, validation_generator, validation_steps):
self.validation_generator = validation_generator
self.validation_steps = validation_steps
def on_epoch_end(self, batch, logs={}):
self.scores = {
'recall_score': [],
'precision_score': [],
'f1_score': []
}
for batch_index in range(self.validation_steps):
features, y_true = next(self.validation_generator)
y_pred = np.asarray(self.model.predict(features))
y_pred = y_pred.round().astype(int)
self.scores['recall_score'].append(recall_score(y_true[:,0], y_pred[:,0]))
self.scores['precision_score'].append(precision_score(y_true[:,0], y_pred[:,0]))
self.scores['f1_score'].append(f1_score(y_true[:,0], y_pred[:,0]))
return
metrics = CustomMetrics(validation_generator, validation_steps)
model.fit_generator(generator=train_generator,
steps_per_epoch=steps_per_epoch,
epochs=epochs,
validation_data=validation_generator,
validation_steps=validation_steps,
shuffle=True,
callbacks=[metrics],
verbose=1)
。数据可能包含NaN值,因此我在原始代码中使用了numpy的nan函数,例如DataFrame
。我的印象是pandas默认跳过NaN值,所以我转而使用相同功能的常规版本。
我想使用nanstd
对数据进行分组并计算一些统计信息,但是当我使用agg()
时,我会得到与原始代码不同的结果,即使数据没有包含任何NaN
这是一个展示问题的小例子
np.std()
如果我使用numpy函数计算std值,我会在两种情况下得到预期的结果。 >>> arr = np.array([[1.17136, 1.11816],
[1.13096, 1.04134],
[1.13865, 1.03414],
[1.09053, 0.96330],
[1.02455, 0.94728],
[1.18182, 1.04950],
[1.09620, 1.06686]])
>>> df = pd.DataFrame(arr,
index=['foo']*3 + ['bar']*4,
columns=['A', 'B'])
>>> df
A B
foo 1.17136 1.11816
foo 1.13096 1.04134
foo 1.13865 1.03414
bar 1.09053 0.96330
bar 1.02455 0.94728
bar 1.18182 1.04950
bar 1.09620 1.06686
>>> g = df.groupby(df.index)
>>> g['A'].agg([np.mean, np.median, np.std])
mean median std
bar 1.098275 1.093365 0.064497
foo 1.146990 1.138650 0.021452
>>> g['A'].agg([np.mean, np.median, np.nanstd])
mean median nanstd
bar 1.098275 1.093365 0.055856
foo 1.146990 1.138650 0.017516
函数内部发生了什么?
agg()
编辑:
正如Vivek Harikrishnan所关注的答案中所提到的,pandas使用不同的方法来计算std。这似乎与我的结果相符
>>> np.std(df.loc['foo', 'A'])
0.01751583474079002
>>> np.nanstd(df.loc['foo', 'A'])
0.017515834740790021
如果我指定一个调用>>> g['A'].agg(['mean', 'median', 'std'])
mean median std
bar 1.098275 1.093365 0.064497
foo 1.146990 1.138650 0.021452
的lambda,我会得到预期的结果
np.std()
这表明在编写>>> g['A'].agg([np.mean, np.median, lambda x: np.std(x)])
mean median <lambda>
bar 1.098275 1.093365 0.055856
foo 1.146990 1.138650 0.017516
时会调用pandas函数。问题是为什么当我明确告诉它使用numpy函数时会发生这种情况?
答案 0 :(得分:2)
Pandas似乎会使用内置的Pandas np.std
方法替换.agg([np.mean, np.median, np.std])
来电中的Series.std()
或致电np.std(series, ddof=1)
:
In [337]: g['A'].agg([np.mean, np.median, np.std, lambda x: np.std(x)])
Out[337]:
mean median std <lambda>
bar 1.098275 1.093365 0.064497 0.055856
foo 1.146990 1.138650 0.021452 0.017516
注意:请注意np.std
和lambda x: np.std(x)
会产生不同的结果。
如果我们明确指定ddof=1
(Pandas默认值),那么我们将得到相同的结果:
In [338]: g['A'].agg([np.mean, np.median, np.std, lambda x: np.std(x, ddof=1)])
Out[338]:
mean median std <lambda>
bar 1.098275 1.093365 0.064497 0.064497
foo 1.146990 1.138650 0.021452 0.021452
使用内置'std'
会产生相同的结果:
In [341]: g['A'].agg([np.mean, np.median, 'std', lambda x: np.std(x, ddof=1)])
Out[341]:
mean median std <lambda>
bar 1.098275 1.093365 0.064497 0.064497
foo 1.146990 1.138650 0.021452 0.021452
Python Zen的第二条规则说明了一切:
In [340]: import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit. # <----------- NOTE !!!
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!