结合索引(axis=0
)的pandas docs give an example of concat
,通过连接列(axis=1
):
In [1]: df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
...: 'B': ['B0', 'B1', 'B2', 'B3'],
...: 'C': ['C0', 'C1', 'C2', 'C3'],
...: 'D': ['D0', 'D1', 'D2', 'D3']},
...: index=[0, 1, 2, 3])
...:
In [2]: df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
...: 'D': ['D2', 'D3', 'D6', 'D7'],
...: 'F': ['F2', 'F3', 'F6', 'F7']},
...: index=[2, 3, 6, 7])
...:
In [3]: result = pd.concat([df1, df4], axis=1)
请注意,df1
和df4
共享索引2
和3
以及列B
和D
。
concat
没有复制共享索引,但它重复了列。
如何避免重复列?
也就是说,我希望result
拥有:
0
,1
,2
,3
,6
,7
和A
,B
,C
,D
,F
。 (没有重复的列!)如果有任何数据发生冲突,我希望引发异常。
答案 0 :(得分:0)
尝试:
pandas.merge(df1, df4, left_index = True, right_index = True, how = 'outer')
您可能需要重命名列以符合您的期望。
答案 1 :(得分:0)
result = df1.join(df4, rsuffix='_dup', how='outer')
#check data clashes
dup_cols = [c for c in result if c.endswith('_dup')]
for c in dup_cols:
if (result[[c[:-4], c]].dropna().apply(pd.Series.nunique, axis=1) > 1).any():
raise Exception("There are conflicts in column %s from two DataFrames" % c[:-4])
result.update(df4)
#remove duplicated cols, since data have been put into 1st occurence of the col
result = result[[c for c in result if not c.endswith('_dup')]]
print result
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 F2
3 A3 B3 C3 D3 F3
6 NaN B6 NaN D6 F6
7 NaN B7 NaN D7 F7
答案 2 :(得分:0)
我基本上要求进行“upsert”(插入,更新)操作。这是一种可行的方法:
首先,“{1}}
中当前不存在的行的”插入“df1
然后,检查两个DataFrame共有的行中的冲突,因此必须更新:
# Add all rows from df4 that don't currently exist in df1
result = pd.concat([df1, df4[~df4.index.isin(df1.index)]])
最后,执行更新:
# Obtain a sliced version of df1, showing only
# the columns and rows shared with the df4
df1_sliced = \
result.loc[result.index.isin(df4.index),
result.columns.isin(df4.columns)]
df4_sliced = \
df4.loc[df4.index.isin(df1.index),
df4.columns.isin(df1.columns)]
# Obtain a mask of the conflicts in the current segment
# as compared with all previously loaded data. That is:
# NaN NaN = False
# NaN 2 = False
# 2 2 = False
# 2 3 = True
# 2 NaN = True
data_conflicts = (pd.notnull(df1_sliced) &
(df1_sliced != df4_sliced))
if data_conflicts.any().any():
raise AssertionError("Data from this segment conflicted "
"with previously loaded data:\n",
data_conflicts)
结果与Happy001的答案相同。不确定哪个更有效率。来自SQL背景,我的回答对我来说更容易理解。
# Replace any rows that do exist with the cur_df version
result.update(df4)