Pandas agg函数给numpy std vs nanstd提供了不同的结果

时间:2017-12-06 14:07:22

标签: python pandas

我正在转换一些numpy代码以使用pandas class CustomMetrics(keras.callbacks.Callback): def __init__(self, validation_generator, validation_steps): self.validation_generator = validation_generator self.validation_steps = validation_steps def on_epoch_end(self, batch, logs={}): self.scores = { 'recall_score': [], 'precision_score': [], 'f1_score': [] } for batch_index in range(self.validation_steps): features, y_true = next(self.validation_generator) y_pred = np.asarray(self.model.predict(features)) y_pred = y_pred.round().astype(int) self.scores['recall_score'].append(recall_score(y_true[:,0], y_pred[:,0])) self.scores['precision_score'].append(precision_score(y_true[:,0], y_pred[:,0])) self.scores['f1_score'].append(f1_score(y_true[:,0], y_pred[:,0])) return metrics = CustomMetrics(validation_generator, validation_steps) model.fit_generator(generator=train_generator, steps_per_epoch=steps_per_epoch, epochs=epochs, validation_data=validation_generator, validation_steps=validation_steps, shuffle=True, callbacks=[metrics], verbose=1) 。数据可能包含NaN值,因此我在原始代码中使用了numpy的nan函数,例如DataFrame。我的印象是pandas默认跳过NaN值,所以我转而使用相同功能的常规版本。

我想使用nanstd对数据进行分组并计算一些统计信息,但是当我使用agg()时,我会得到与原始代码不同的结果,即使数据没有包含任何NaN

这是一个展示问题的小例子

np.std()

如果我使用numpy函数计算std值,我会在两种情况下得到预期的结果。 >>> arr = np.array([[1.17136, 1.11816], [1.13096, 1.04134], [1.13865, 1.03414], [1.09053, 0.96330], [1.02455, 0.94728], [1.18182, 1.04950], [1.09620, 1.06686]]) >>> df = pd.DataFrame(arr, index=['foo']*3 + ['bar']*4, columns=['A', 'B']) >>> df A B foo 1.17136 1.11816 foo 1.13096 1.04134 foo 1.13865 1.03414 bar 1.09053 0.96330 bar 1.02455 0.94728 bar 1.18182 1.04950 bar 1.09620 1.06686 >>> g = df.groupby(df.index) >>> g['A'].agg([np.mean, np.median, np.std]) mean median std bar 1.098275 1.093365 0.064497 foo 1.146990 1.138650 0.021452 >>> g['A'].agg([np.mean, np.median, np.nanstd]) mean median nanstd bar 1.098275 1.093365 0.055856 foo 1.146990 1.138650 0.017516 函数内部发生了什么?

agg()

编辑:

正如Vivek Harikrishnan所关注的答案中所提到的,pandas使用不同的方法来计算std。这似乎与我的结果相符

>>> np.std(df.loc['foo', 'A'])
0.01751583474079002
>>> np.nanstd(df.loc['foo', 'A'])
0.017515834740790021

如果我指定一个调用>>> g['A'].agg(['mean', 'median', 'std']) mean median std bar 1.098275 1.093365 0.064497 foo 1.146990 1.138650 0.021452 的lambda,我会得到预期的结果

np.std()

这表明在编写>>> g['A'].agg([np.mean, np.median, lambda x: np.std(x)]) mean median <lambda> bar 1.098275 1.093365 0.055856 foo 1.146990 1.138650 0.017516 时会调用pandas函数。问题是为什么当我明确告诉它使用numpy函数时会发生这种情况?

1 个答案:

答案 0 :(得分:2)

Pandas似乎会使用内置的Pandas np.std方法替换.agg([np.mean, np.median, np.std])来电中的Series.std()或致电np.std(series, ddof=1)

In [337]: g['A'].agg([np.mean, np.median, np.std, lambda x: np.std(x)])
Out[337]:
         mean    median       std  <lambda>
bar  1.098275  1.093365  0.064497  0.055856
foo  1.146990  1.138650  0.021452  0.017516

注意:请注意np.stdlambda x: np.std(x)会产生不同的结果。

如果我们明确指定ddof=1(Pandas默认值),那么我们将得到相同的结果:

In [338]: g['A'].agg([np.mean, np.median, np.std, lambda x: np.std(x, ddof=1)])
Out[338]:
         mean    median       std  <lambda>
bar  1.098275  1.093365  0.064497  0.064497
foo  1.146990  1.138650  0.021452  0.021452

使用内置'std'会产生相同的结果:

In [341]: g['A'].agg([np.mean, np.median, 'std', lambda x: np.std(x, ddof=1)])
Out[341]:
         mean    median       std  <lambda>
bar  1.098275  1.093365  0.064497  0.064497
foo  1.146990  1.138650  0.021452  0.021452

Python Zen的第二条规则说明了一切:

In [340]: import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.  # <----------- NOTE !!!
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!