Question

我遇到了一个奇怪的问题（或打算？），其中combine_first或update导致存储为bool的值被上传到float64 s提供的参数不提供布尔列。

ipython中的示例工作流程：

In [144]: test = pd.DataFrame([[1,2,False,True],[4,5,True,False]], columns=['a','b','isBool', 'isBool2'])

In [145]: test
Out[145]:
   a  b isBool isBool2
0  1  2  False    True
1  4  5   True   False


In [147]: b = pd.DataFrame([[45,45]], index=[0], columns=['a','b'])

In [148]: b
Out[148]:
    a   b
0  45  45

In [149]: test.update(b)

In [150]: test
Out[150]:
    a   b  isBool  isBool2
0  45  45       0        1
1   4   5       1        0

这是update函数的行为吗？我认为如果没有指定update不会混淆其他列。

编辑：我开始修补一下。情节变粗。如果我在运行test.update([])之前再插入一个命令test.update(b)，则布尔行为的工作代价为objects。这也适用于DSM的简化示例。

基于panda's source code，看起来reindex_like方法正在创建dtype object的DataFrame，而reindex_like b创建dtype float64的DataFrame。由于object更为通用，因此后续操作与bool一起使用。不幸的是，在数字列上运行np.log将失败并显示AttributeError。

Answer 1

在更新之前，日期框架b已由reindex_link填充，因此b变为

In [5]: b.reindex_like(a)
Out[5]: 
    a   b  isBool  isBool2
0  45  45     NaN      NaN
1 NaN NaN     NaN      NaN

然后使用numpy.where更新数据框。

悲剧是numpy.where，如果两个数据的类型不同，则会使用更一般的数据。例如

In [20]: np.where(True, [True], [0])
Out[20]: array([1])

In [21]: np.where(True, [True], [1.0])
Out[21]: array([ 1.])

由于NaN中的numpy是浮动类型，因此它也会返回浮动类型。

In [22]: np.where(True, [True], [np.nan])
Out[22]: array([ 1.])

因此，更新后，您的'isBool'和'isBool2'列将成为浮动类型。

我在the issue tracker for pandas

上添加了此问题

Answer 2

这是一个错误，更新不应该触及未指定的列，修复此处https://github.com/pydata/pandas/pull/3021

pandas DataFrame combine_first和update方法有奇怪的行为

2 个答案: