Question

我希望执行以下任务：

给定2个pandas DataFrame，每个都有一列但长度不同，创建一个新的DataFrame，其索引是其他2个DataFrame的并集，并拥有两列：一列指示DataFrame 1是否包含该特定索引的值，以及一个指示DataFrame 2是否包含该特定索引的值。

我有以下示例数据：

rng = pd.date_range('1/1/2017', periods=365, freq='D')
rng2 = pd.date_range('1/1/2016',periods=730, freq='D')
x1 = np.random.randn(365)
x2 = np.random.randn(730)
df1 = pd.DataFrame({'x':x1}, index=rng)
df2 = pd.DataFrame({'x':x2}, index=rng2)

我可以通过以下方式获得指数的并集：

idx = df1.index.union(df2.index)

现在，我想创建一个新的DataFrame df3，其索引为idx，并且根据上述要求填充了2列和0。

我已经探索过使用.isin()功能，但据我所知，可能需要事先了解一点DataFrames，而我想更灵活地实现这一点。

Answer 1

外部联接和notnull()的测试可以实现所需的行为。使用您的示例数据，它看起来像：

notnull = df1.join(df2.rename(columns={'x': 'x2'}), how='outer').notnull()

示例数据：

rng1 = pd.date_range('1/2/2017', periods=4, freq='D')
rng2 = pd.date_range('1/1/2017', periods=4, freq='D')
x = np.random.randn(4)
df1 = pd.DataFrame({'x': x}, index=rng1)
df2 = pd.DataFrame({'x': x}, index=rng2)

测试一下：

notnull = df1.join(df2.rename(columns={'x': 'x2'}), how='outer').notnull()
print(notnull)

<强>输出：

                x     x2
2017-01-01  False   True
2017-01-02   True   True
2017-01-03   True   True
2017-01-04   True   True
2017-01-05   True  False

从评论中更新：

如果您想要实际的 1和0而不是bool，

ones_and_zeros= df1.join(df2.rename(columns={'x': 'x2'}), 
                                    how='outer').notnull().astype(np.uint8)
print(ones_and_zeros)

<强>输出：

            x  x2
2017-01-01  0   1
2017-01-02  1   1
2017-01-03  1   1
2017-01-04  1   1
2017-01-05  1   0

数据帧合并+与存在测试的比较

1 个答案: