Question

我想连接2个pandas DataFrames，每个都有可能重叠的时间序列索引，但也有可能重叠的列键。

例如：

    old_close                                   new_close
             1TM    ABL  ...                    ABL    ANG    ...
Date                                Date
2009-06-05  100     564             1990-06-08  120    2533   
2009-06-04  102     585             1990-06-05  121    2531
2009-06-03  101     532             1990-06-04  123    2520
2009-06-02  99      540             1990-06-03  122    2519
2009-06-01  99      542             1990-06-02  121    2521
...

我想合并old_close和new_close以形成一个新的DataFrame，其中包含两个DataFrame中的所有数据，但不包括两个索引上的所有重复值。

到目前为止，我这样做了：

merged_close = pd.concat([old_close, new_close], axis=1)

但这导致重复的列（沿轴0的行）和MultiIndex。

Answer 1

假设您要“排除两个索引上的所有重复值”，这应该可以正常工作

unique_indices = np.setdiff1d(np.unioin1d(old_close.index.to_list(), new_close.index.to_list()), 
                              np.intersect1d(old_close.index.to_list(), new_close.index.to_list()))
merged_close = pd.concat([old_close, new_close]).ix[unique_indices]

编辑：更新了唯一指数计算。现在删除所有重复的索引

Answer 2

来自panda documentation：

concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
       keys=None, levels=None, names=None, verify_integrity=False)

verify_integrity：boolean，默认为False。检查是否新连接轴包含重复项。这可能非常昂贵相对于实际数据连接

您是否尝试将该参数设置为true？

编辑：

对不起，如果有重复项，verify_integrity只会引发错误。无论如何，您可以尝试查看drop_duplicates()函数。

PS：还要看一下这个问题：

python pandas remove duplicate columns

沿时间序列索引连接pandas数据帧而不复制列

2 个答案: