Pandas连接,没有重复的索引或列

时间:2015-11-29 04:22:58

标签: python pandas

结合索引(axis=0)的pandas docs give an example of concat,通过连接列(axis=1):

In [1]: df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
   ...:                     'B': ['B0', 'B1', 'B2', 'B3'],
   ...:                     'C': ['C0', 'C1', 'C2', 'C3'],
   ...:                     'D': ['D0', 'D1', 'D2', 'D3']},
   ...:                     index=[0, 1, 2, 3])
   ...: 
In [2]: df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
   ...:                  'D': ['D2', 'D3', 'D6', 'D7'],
   ...:                  'F': ['F2', 'F3', 'F6', 'F7']},
   ...:                 index=[2, 3, 6, 7])
   ...: 

In [3]: result = pd.concat([df1, df4], axis=1)

enter image description here

请注意,df1df4共享索引23以及列BD

concat没有复制共享索引,但它重复了列。

如何避免重复列?

也就是说,我希望result拥有:

  • 指数012367
  • ABCDF。 (没有重复的列!)

如果有任何数据发生冲突,我希望引发异常。

3 个答案:

答案 0 :(得分:0)

尝试:

pandas.merge(df1, df4, left_index = True, right_index = True, how = 'outer')

您可能需要重命名列以符合您的期望。

答案 1 :(得分:0)

result = df1.join(df4, rsuffix='_dup', how='outer')

#check data clashes
dup_cols = [c for c in result if c.endswith('_dup')]
for c in dup_cols:
    if (result[[c[:-4], c]].dropna().apply(pd.Series.nunique, axis=1) > 1).any():
        raise Exception("There are conflicts in column %s from two DataFrames" % c[:-4])

result.update(df4)

#remove duplicated cols, since data have been put into 1st occurence of the col
result = result[[c for c in result if not c.endswith('_dup')]]

print result

     A   B    C   D    F
0   A0  B0   C0  D0  NaN
1   A1  B1   C1  D1  NaN
2   A2  B2   C2  D2   F2
3   A3  B3   C3  D3   F3
6  NaN  B6  NaN  D6   F6
7  NaN  B7  NaN  D7   F7

答案 2 :(得分:0)

我基本上要求进行“upsert”(插入,更新)操作。这是一种可行的方法:

首先,“{1}}

中当前不存在的行的”插入“
df1

然后,检查两个DataFrame共有的行中的冲突,因此必须更新:

# Add all rows from df4 that don't currently exist in df1
result = pd.concat([df1, df4[~df4.index.isin(df1.index)]])

最后,执行更新:

# Obtain a sliced version of df1, showing only
# the columns and rows shared with the df4
df1_sliced = \
    result.loc[result.index.isin(df4.index),
               result.columns.isin(df4.columns)]
df4_sliced = \
    df4.loc[df4.index.isin(df1.index),
            df4.columns.isin(df1.columns)]

# Obtain a mask of the conflicts in the current segment
# as compared with all previously loaded data.  That is:
# NaN NaN = False
# NaN 2   = False
# 2   2   = False
# 2   3   = True
# 2   NaN = True
data_conflicts = (pd.notnull(df1_sliced) & 
                  (df1_sliced != df4_sliced))

if data_conflicts.any().any():
    raise AssertionError("Data from this segment conflicted "
                         "with previously loaded data:\n", 
                         data_conflicts)

结果与Happy001的答案相同。不确定哪个更有效率。来自SQL背景,我的回答对我来说更容易理解。

# Replace any rows that do exist with the cur_df version
result.update(df4)