pandas离开了join并更新了现有的列

时间:2015-05-05 05:41:25

标签: python pandas

我是pandas的新手,似乎无法使用合并功能:

>>> left       >>> right
   a  b   c       a  c   d 
0  1  4   9    0  1  7  13
1  2  5  10    1  2  8  14
2  3  6  11    2  3  9  15
3  4  7  12    

在a栏上有一个左连接,我想通过JOINED KEYS更新常用列。注意列c中的最后一个值来自LEFT表,因为没有匹配项。

>>> final       
   a  b   c   d 
0  1  4   7   13
1  2  5   8   14
2  3  6   9   15
3  4  7   12  NAN 

我应该如何使用Pandas合并功能?谢谢。

6 个答案:

答案 0 :(得分:18)

您可以在merge()left之间使用right how='left''a'

In [74]: final = left.merge(right, on='a', how='left')

In [75]: final
Out[75]:
   a  b  c_x  c_y   d
0  1  4    9    7  13
1  2  5   10    8  14
2  3  6   11    9  15
3  4  7   12  NaN NaN

NaN的{​​{1}}替换为c_y

c_x

删除不需要的列,结果为

In [76]: final['c'] = final['c_y'].fillna(final['c_x'])

In [77]: final
Out[77]:
   a  b  c_x  c_y   d   c
0  1  4    9    7  13   7
1  2  5   10    8  14   8
2  3  6   11    9  15   9
3  4  7   12  NaN NaN  12

答案 1 :(得分:16)

一种方法是将a列设置为索引并update

In [11]: left_a = left.set_index('a')

In [12]: right_a = right.set_index('a')

注意:update仅执行左连接(不合并),因此除了set_index之外,您还需要包含left_a中不存在的其他列。 < / p>

In [13]: res = left_a.reindex(columns=left_a.columns.union(right_a.columns))

In [14]: res.update(right_a)

In [15]: res.reset_index(inplace=True)

In [16]: res
Out[16]:
   a   b   c   d
0  1   4   7  13
1  2   5   8  14
2  3   6   9  15
3  4   7  12 NaN

答案 2 :(得分:1)

以下是使用join

的方法
In [632]: t = left.set_index('a').join(right.set_index('a'), rsuffix='_right')

In [633]: t
Out[633]: 
   b   c  c_right   d
a                    
1  4   9        7  13
2  5  10        8  14
3  6  11        9  15
4  7  12      NaN NaN

现在,我们要将c_right(来自right数据框)的空值设置为来自c数据帧的left列的值。使用@John Galt的回答

的方法更新了以下过程
In [657]: t['c_right'] = t['c_right'].fillna(t['c'])

In [658]: t
Out[658]: 
   b   c  c_right   d
a                    
1  4   9        7  13
2  5  10        8  14
3  6  11        9  15
4  7  12       12 NaN

In [659]: t.drop('c_right', axis=1)
Out[659]: 
   b   c   d
a           
1  4   9  13
2  5  10  14
3  6  11  15
4  7  12 NaN

答案 3 :(得分:0)

另一种方法是像这样使用pd.merge

 >>> import pandas as pd

 >>> final = pd.merge(right, left, 
                      how='outer',
                      left_index=True,
                      right_index=True,
                      on=('a', 'c')
                     ).sort_index(axis=1)

 >>> final       
    a  b   c   d 
 0  1  4   7   13.0
 1  2  5   8   14.0
 2  3  6   9   15.0
 3  4  7   12  NaN 

您可以计算要更新的两个DataFrame列名称的交集,以将其传递给函数的'on ='参数。

它不会创建像Zero的解决方案一样必须删除的不需要的列。

编辑: NaN值可能会将整数更改为同一列中的浮点数。

答案 4 :(得分:0)

DataFrame.update()很不错,但是它不允许您指定要连接的列,更重要的是,如果 other 数据帧具有NaN值,则这些NaN值将不会覆盖非原始DataFrame中的nan值。对我来说,这是不受欢迎的行为。

这是我采用的自定义方法来解决这些问题。它是刚写的,所以用户要小心。.

join_insertion()

def join_insertion(into_df, from_df, on, cols, mult='error'):
    """
    Suppose A and B are dataframes. A has columns {foo, bar, baz} and B has columns {foo, baz, buz}
    This function allows you to do an operation like:
    "where A and B match via the column foo, insert the values of baz and buz from B into A"
    Note that this'll update A's values for baz and it'll insert buz as a new column.
    This is a lot like DataFrame.update(), but that method annoyingly ignores NaN values in B!

    :param into_df: dataframe you want to modify
    :param from_df: dataframe with the values you want to insert
    :param cols: list of column names (values to insert)
    :param on: list of column names (values to join on), or a dict of {into:from} column name pairs
    :param mult: if a key of into_df matches multiple rows of from_df, how should this be handled?
    an error can be raised, or the first matching value can be inserted, or the last matching value
    can be inserted
    :return: a modified copy of into_df, with updated values using from_df
    """

    # Infer left_on, right_on
    if (isinstance(on, dict)):
        left_on = list(on.keys())
        right_on = list(on.values())
    elif(isinstance(on, list)):
        left_on = on
        right_on = on
    elif(isinstance(on, str)):
        left_on = [on]
        right_on = [on]
    else:
        raise Exception("on should be a list or dictionary")

    # Make cols a list if it isn't already
    if(isinstance(cols, str)):
        cols = [cols]

    # Setup
    A = into_df.copy()
    B = from_df[right_on + cols].copy()

    # Insert row ids
    A['_A_RowId_'] = np.arange(A.shape[0])
    B['_B_RowId_'] = np.arange(B.shape[0])

    A = pd.merge(
        left=A,
        right=B,
        how='left',
        left_on=left_on,
        right_on=right_on,
        suffixes=(None, '_y'),
        indicator=True
    ).sort_values(['_A_RowId_', '_B_RowId_'])

    # Check for rows of A which got duplicated by the merge, and then handle appropriately
    if(mult == 'error'):
        if(A.groupby('_A_RowId_').size().max() > 1):
            raise Exception("At least one key of into_df matched multiple rows of from_df.")
    elif(mult == 'first'):
        A = A.groupby('_A_RowId_').first().reset_index()
    elif(mult == 'last'):
        A = A.groupby('_A_RowId_').last().reset_index()

    mask = A._merge == 'both'
    cols_in_both = list(set(into_df.columns.to_list()).intersection(set(cols)))
    for col in cols_in_both:
        A.loc[mask, col] = A.loc[mask, col + '_y']

    # Drop unwanted columns
    A.drop(columns=list(set(A.columns).difference(set(into_df.columns.to_list() + cols))), inplace=True)

    return A

示例用法

into_df = pd.DataFrame({
    'foo': [1, 2, 3],
    'bar': [4, 5, 6],
    'baz': [7, 8, 9]
})
   foo  bar  baz
0    1    4    7
1    2    5    8
2    3    6    9

from_df = pd.DataFrame({
    'foo': [1, 3, 5, 7, 3],
    'baz': [70, 80, 90, 30, 40],
    'buz': [0, 1, 2, 3, 4]
})
   foo  baz  buz
0    1   70    0
1    3   80    1
2    5   90    2
3    7   30    3
4    3   40    4

# Use it!

join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='error')
  Exception: At least one key of into_df matched multiple rows of from_df.

join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='first')
   foo  bar   baz  buz
0    1    4  70.0  0.0
1    2    5   8.0  NaN
2    3    6  80.0  1.0

join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='last')
   foo  bar   baz  buz
0    1    4  70.0  0.0
1    2    5   8.0  NaN
2    3    6  40.0  4.0

顺便说一句,这是我从R的data.table包中严重错过的事情之一。使用data.table,这就像x[y, Foo := i.Foo, on = c("a", "b")]

答案 5 :(得分:0)

这是另一种应该使用 .sv

的方法
combine_first()