使用pandas merge,结果列令人困惑:
df1 = pd.DataFrame(np.random.randint(0, 100, size=(5, 5)))
df2 = pd.DataFrame(np.random.randint(0, 100, size=(5, 5)))
df2[0] = df1[0] # matching key on the first column.
# Now the weird part.
pd.merge(df1, df2, left_on=0, right_on=0).shape
Out[96]: (5, 9)
pd.merge(df1, df2, left_index=True, right_index=True).shape
Out[102]: (5, 10)
pd.merge(df1, df2, left_on=0, right_on=1).shape
Out[107]: (0, 11)
列数不固定,列标签也不稳定,更糟糕的是这些没有清楚记录。
我想读取结果数据框的一些列,这些列有很多列(数百个)。目前我正在使用.iloc [],因为标签太多了。但我担心由于奇怪的合并结果,这很容易出错。 读取合并数据框中某些列的正确方法是什么?
Python:2.7.13,Pandas:0.19.2
答案 0 :(得分:1)
合并密钥
1.1当join-key是一列时合并键(当你在第一列上说“df2 [0] = df1 [0]#matching key时,这是正确的解决方案。 “)
1.2当merge-key是索引时合并索引 ==>为什么你在第二次合并中得到1列的原因(pd.merge(df1,df2,left_index = True,right_index = True).shape)是因为初始连接键现在出现两次“0_x”& '0_y'
关于列名称
在合并期间,列名称不会更改,除非两个数据框中都有相同名称的列。列改变如下,你得到:
'initial_column_name'+'_ x'(后缀'_x'被添加到左侧数据框的列(df1))
'initial_column_name'+'_ y'(后缀'_y'添加到右侧数据框的列(df2))
答案 1 :(得分:0)
为了处理合并结果中列数的3种不同情况,我最终检查列数,然后转换列号索引以在.iloc []中使用。以下是未来搜索者的代码。
我现在知道处理大量列的最佳方法。如果有的话我会给出更好的答案。
转换列号索引的实用程序方法:
def get_merged_column_index(num_col_df, num_col_df1, num_col_df2, col_df1=[], col_df2=[], joinkey_df1=[], joinkey_df2=[]):
"""Transform the column indexes in old source dataframes to column indexes in merged dataframe. Check for different pandas merged result formats.
:param num_col_df: number of columns in merged dataframe df.
:param num_col_df1: number of columns in df1.
:param num_col_df2: number of columns in df2.
:param col_df1: (list of int) column position in df1 to keep (0-based).
:param col_df2: (list of int) column position in df2 to keep (0-based).
:param joinkey_df1: (list of int) column position (0-based). Not implemented now.
:param joinkey_df2: (list of int) column position (0-based). Not implemented now.
:return: (list of int) transformed column indexes, 0-based, in merged dataframe.
"""
col_df1 = np.array(col_df1)
col_df2 = np.array(col_df2)
if num_col_df == num_col_df1 + num_col_df2: # merging keeps same old columns
col_df2 += num_col_df1
elif num_col_df == num_col_df1 + num_col_df2 + 1: # merging add column 'key_0' to the head
col_df1 += 1
col_df2 += num_col_df1 + 1
elif num_col_df <= num_col_df1 + num_col_df2 - 1: # merging deletes (possibly many) duplicated "join-key" columns in df2, keep and do not change order columns in df1.
raise ValueError('Format of merged result is too complicated.')
else:
raise ValueError('Undefined format of merged result.')
return np.concatenate((col_df1, col_df2)).astype(int).tolist()
然后:
cols_toextract_df1 = []
cols_toextract_df2 = []
converted_cols = get_merged_column_index(num_col_df=df.shape[1], num_col_df1=df1.shape[1], num_col_df2=df2.shape[1], col_df1=cols_toextract_df1, col_df2=cols_toextract_df1)
extracted_df = df.iloc[:, converted_cols]