如何阅读pandas merge的结果?

时间:2017-02-27 09:44:28

标签: python pandas join dataframe merge

使用pandas merge,结果列令人困惑:

df1 = pd.DataFrame(np.random.randint(0, 100, size=(5, 5)))

df2 = pd.DataFrame(np.random.randint(0, 100, size=(5, 5)))

df2[0] = df1[0]  # matching key on the first column.

# Now the weird part.
pd.merge(df1, df2, left_on=0, right_on=0).shape
Out[96]: (5, 9)
pd.merge(df1, df2, left_index=True, right_index=True).shape
Out[102]: (5, 10)
pd.merge(df1, df2, left_on=0, right_on=1).shape
Out[107]: (0, 11)

列数不固定,列标签也不稳定,更糟糕的是这些没有清楚记录。

我想读取结果数据框的一些列,这些列有很多列(数百个)。目前我正在使用.iloc [],因为标签太多了。但我担心由于奇怪的合并结果,这很容易出错。 读取合并数据框中某些列的正确方法是什么?

Python:2.7.13,Pandas:0.19.2

2 个答案:

答案 0 :(得分:1)

合并密钥

1.1当join-key是一列时合并键(当你在第一列上说“df2 [0] = df1 [0]#matching key时,这是正确的解决方案。 “)

1.2当merge-key是索引时合并索引  ==>为什么你在第二次合并中得到1列的原因(pd.merge(df1,df2,left_index = True,right_index = True).shape)是因为初始连接键现在出现两次“0_x”& '0_y'

关于列名称

在合并期间,列名称不会更改,除非两个数据框中都有相同名称的列。列改变如下,你得到:

  • 'initial_column_name'+'_ x'(后缀'_x'被添加到左侧数据框的列(df1))

  • 'initial_column_name'+'_ y'(后缀'_y'添加到右侧数据框的列(df2))

答案 1 :(得分:0)

为了处理合并结果中列数的3种不同情况,我最终检查列数,然后转换列号索引以在.iloc []中使用。以下是未来搜索者的代码。

我现在知道处理大量列的最佳方法。如果有的话我会给出更好的答案。

转换列号索引的实用程序方法:

def get_merged_column_index(num_col_df, num_col_df1, num_col_df2, col_df1=[], col_df2=[], joinkey_df1=[], joinkey_df2=[]):
    """Transform the column indexes in old source dataframes to column indexes in merged dataframe. Check for different pandas merged result formats.

    :param num_col_df: number of columns in merged dataframe df.
    :param num_col_df1: number of columns in df1.
    :param num_col_df2: number of columns in df2.
    :param col_df1: (list of int) column position in df1 to keep (0-based).
    :param col_df2: (list of int) column position in df2 to keep (0-based).
    :param joinkey_df1:  (list of int) column position (0-based). Not implemented now.
    :param joinkey_df2:  (list of int) column position (0-based). Not implemented now.
    :return: (list of int) transformed column indexes, 0-based, in merged dataframe.
    """

    col_df1 = np.array(col_df1)
    col_df2 = np.array(col_df2)

    if num_col_df == num_col_df1 + num_col_df2: # merging keeps same old columns
        col_df2 += num_col_df1
    elif num_col_df == num_col_df1 + num_col_df2 + 1: # merging add column 'key_0' to the head
        col_df1 += 1
        col_df2 += num_col_df1 + 1
    elif num_col_df <= num_col_df1 + num_col_df2 - 1: # merging deletes (possibly many) duplicated "join-key" columns in df2, keep and do not change order columns in df1.
        raise ValueError('Format of merged result is too complicated.')
    else:
        raise ValueError('Undefined format of merged result.')

    return np.concatenate((col_df1, col_df2)).astype(int).tolist()

然后:

cols_toextract_df1 = []
cols_toextract_df2 = []
converted_cols = get_merged_column_index(num_col_df=df.shape[1], num_col_df1=df1.shape[1], num_col_df2=df2.shape[1], col_df1=cols_toextract_df1, col_df2=cols_toextract_df1)
extracted_df = df.iloc[:, converted_cols]