Question

结合索引（axis=0）的pandas docs give an example of concat，通过连接列（axis=1）：

In [1]: df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
   ...:                     'B': ['B0', 'B1', 'B2', 'B3'],
   ...:                     'C': ['C0', 'C1', 'C2', 'C3'],
   ...:                     'D': ['D0', 'D1', 'D2', 'D3']},
   ...:                     index=[0, 1, 2, 3])
   ...: 
In [2]: df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
   ...:                  'D': ['D2', 'D3', 'D6', 'D7'],
   ...:                  'F': ['F2', 'F3', 'F6', 'F7']},
   ...:                 index=[2, 3, 6, 7])
   ...: 

In [3]: result = pd.concat([df1, df4], axis=1)

请注意，df1和df4共享索引2和3以及列B和D。

concat没有复制共享索引，但它重复了列。

如何避免重复列？

也就是说，我希望result拥有：

指数0，1，2，3，6，7和
列A，B，C，D，F。（没有重复的列！）

如果有任何数据发生冲突，我希望引发异常。

Answer 1

尝试：

pandas.merge(df1, df4, left_index = True, right_index = True, how = 'outer')

您可能需要重命名列以符合您的期望。

Answer 2

result = df1.join(df4, rsuffix='_dup', how='outer')

#check data clashes
dup_cols = [c for c in result if c.endswith('_dup')]
for c in dup_cols:
    if (result[[c[:-4], c]].dropna().apply(pd.Series.nunique, axis=1) > 1).any():
        raise Exception("There are conflicts in column %s from two DataFrames" % c[:-4])

result.update(df4)

#remove duplicated cols, since data have been put into 1st occurence of the col
result = result[[c for c in result if not c.endswith('_dup')]]

print result

     A   B    C   D    F
0   A0  B0   C0  D0  NaN
1   A1  B1   C1  D1  NaN
2   A2  B2   C2  D2   F2
3   A3  B3   C3  D3   F3
6  NaN  B6  NaN  D6   F6
7  NaN  B7  NaN  D7   F7

Answer 3

我基本上要求进行“upsert”（插入，更新）操作。这是一种可行的方法：

首先，“{1}}

中当前不存在的行的”插入“

df1

然后，检查两个DataFrame共有的行中的冲突，因此必须更新：

# Add all rows from df4 that don't currently exist in df1
result = pd.concat([df1, df4[~df4.index.isin(df1.index)]])

最后，执行更新：

# Obtain a sliced version of df1, showing only
# the columns and rows shared with the df4
df1_sliced = \
    result.loc[result.index.isin(df4.index),
               result.columns.isin(df4.columns)]
df4_sliced = \
    df4.loc[df4.index.isin(df1.index),
            df4.columns.isin(df1.columns)]

# Obtain a mask of the conflicts in the current segment
# as compared with all previously loaded data.  That is:
# NaN NaN = False
# NaN 2   = False
# 2   2   = False
# 2   3   = True
# 2   NaN = True
data_conflicts = (pd.notnull(df1_sliced) & 
                  (df1_sliced != df4_sliced))

if data_conflicts.any().any():
    raise AssertionError("Data from this segment conflicted "
                         "with previously loaded data:\n", 
                         data_conflicts)

结果与Happy001的答案相同。不确定哪个更有效率。来自SQL背景，我的回答对我来说更容易理解。

# Replace any rows that do exist with the cur_df version
result.update(df4)

Pandas连接，没有重复的索引或列

3 个答案: