我是pandas的新手,似乎无法使用合并功能:
>>> left >>> right
a b c a c d
0 1 4 9 0 1 7 13
1 2 5 10 1 2 8 14
2 3 6 11 2 3 9 15
3 4 7 12
在a栏上有一个左连接,我想通过JOINED KEYS更新常用列。注意列c中的最后一个值来自LEFT表,因为没有匹配项。
>>> final
a b c d
0 1 4 7 13
1 2 5 8 14
2 3 6 9 15
3 4 7 12 NAN
我应该如何使用Pandas合并功能?谢谢。
答案 0 :(得分:18)
您可以在merge()
和left
之间使用right
how='left'
列'a'
。
In [74]: final = left.merge(right, on='a', how='left')
In [75]: final
Out[75]:
a b c_x c_y d
0 1 4 9 7 13
1 2 5 10 8 14
2 3 6 11 9 15
3 4 7 12 NaN NaN
将NaN
的{{1}}替换为c_y
值
c_x
删除不需要的列,结果为
In [76]: final['c'] = final['c_y'].fillna(final['c_x'])
In [77]: final
Out[77]:
a b c_x c_y d c
0 1 4 9 7 13 7
1 2 5 10 8 14 8
2 3 6 11 9 15 9
3 4 7 12 NaN NaN 12
答案 1 :(得分:16)
一种方法是将a列设置为索引并update
:
In [11]: left_a = left.set_index('a')
In [12]: right_a = right.set_index('a')
注意:update
仅执行左连接(不合并),因此除了set_index之外,您还需要包含left_a
中不存在的其他列。 < / p>
In [13]: res = left_a.reindex(columns=left_a.columns.union(right_a.columns))
In [14]: res.update(right_a)
In [15]: res.reset_index(inplace=True)
In [16]: res
Out[16]:
a b c d
0 1 4 7 13
1 2 5 8 14
2 3 6 9 15
3 4 7 12 NaN
答案 2 :(得分:1)
以下是使用join
:
In [632]: t = left.set_index('a').join(right.set_index('a'), rsuffix='_right')
In [633]: t
Out[633]:
b c c_right d
a
1 4 9 7 13
2 5 10 8 14
3 6 11 9 15
4 7 12 NaN NaN
现在,我们要将c_right
(来自right
数据框)的空值设置为来自c
数据帧的left
列的值。使用@John Galt的回答
In [657]: t['c_right'] = t['c_right'].fillna(t['c'])
In [658]: t
Out[658]:
b c c_right d
a
1 4 9 7 13
2 5 10 8 14
3 6 11 9 15
4 7 12 12 NaN
In [659]: t.drop('c_right', axis=1)
Out[659]:
b c d
a
1 4 9 13
2 5 10 14
3 6 11 15
4 7 12 NaN
答案 3 :(得分:0)
另一种方法是像这样使用pd.merge:
>>> import pandas as pd
>>> final = pd.merge(right, left,
how='outer',
left_index=True,
right_index=True,
on=('a', 'c')
).sort_index(axis=1)
>>> final
a b c d
0 1 4 7 13.0
1 2 5 8 14.0
2 3 6 9 15.0
3 4 7 12 NaN
您可以计算要更新的两个DataFrame列名称的交集,以将其传递给函数的'on ='参数。
它不会创建像Zero的解决方案一样必须删除的不需要的列。
编辑: NaN值可能会将整数更改为同一列中的浮点数。
答案 4 :(得分:0)
DataFrame.update()很不错,但是它不允许您指定要连接的列,更重要的是,如果 other 数据帧具有NaN值,则这些NaN值将不会覆盖非原始DataFrame中的nan值。对我来说,这是不受欢迎的行为。
这是我采用的自定义方法来解决这些问题。它是刚写的,所以用户要小心。.
def join_insertion(into_df, from_df, on, cols, mult='error'):
"""
Suppose A and B are dataframes. A has columns {foo, bar, baz} and B has columns {foo, baz, buz}
This function allows you to do an operation like:
"where A and B match via the column foo, insert the values of baz and buz from B into A"
Note that this'll update A's values for baz and it'll insert buz as a new column.
This is a lot like DataFrame.update(), but that method annoyingly ignores NaN values in B!
:param into_df: dataframe you want to modify
:param from_df: dataframe with the values you want to insert
:param cols: list of column names (values to insert)
:param on: list of column names (values to join on), or a dict of {into:from} column name pairs
:param mult: if a key of into_df matches multiple rows of from_df, how should this be handled?
an error can be raised, or the first matching value can be inserted, or the last matching value
can be inserted
:return: a modified copy of into_df, with updated values using from_df
"""
# Infer left_on, right_on
if (isinstance(on, dict)):
left_on = list(on.keys())
right_on = list(on.values())
elif(isinstance(on, list)):
left_on = on
right_on = on
elif(isinstance(on, str)):
left_on = [on]
right_on = [on]
else:
raise Exception("on should be a list or dictionary")
# Make cols a list if it isn't already
if(isinstance(cols, str)):
cols = [cols]
# Setup
A = into_df.copy()
B = from_df[right_on + cols].copy()
# Insert row ids
A['_A_RowId_'] = np.arange(A.shape[0])
B['_B_RowId_'] = np.arange(B.shape[0])
A = pd.merge(
left=A,
right=B,
how='left',
left_on=left_on,
right_on=right_on,
suffixes=(None, '_y'),
indicator=True
).sort_values(['_A_RowId_', '_B_RowId_'])
# Check for rows of A which got duplicated by the merge, and then handle appropriately
if(mult == 'error'):
if(A.groupby('_A_RowId_').size().max() > 1):
raise Exception("At least one key of into_df matched multiple rows of from_df.")
elif(mult == 'first'):
A = A.groupby('_A_RowId_').first().reset_index()
elif(mult == 'last'):
A = A.groupby('_A_RowId_').last().reset_index()
mask = A._merge == 'both'
cols_in_both = list(set(into_df.columns.to_list()).intersection(set(cols)))
for col in cols_in_both:
A.loc[mask, col] = A.loc[mask, col + '_y']
# Drop unwanted columns
A.drop(columns=list(set(A.columns).difference(set(into_df.columns.to_list() + cols))), inplace=True)
return A
into_df = pd.DataFrame({
'foo': [1, 2, 3],
'bar': [4, 5, 6],
'baz': [7, 8, 9]
})
foo bar baz
0 1 4 7
1 2 5 8
2 3 6 9
from_df = pd.DataFrame({
'foo': [1, 3, 5, 7, 3],
'baz': [70, 80, 90, 30, 40],
'buz': [0, 1, 2, 3, 4]
})
foo baz buz
0 1 70 0
1 3 80 1
2 5 90 2
3 7 30 3
4 3 40 4
# Use it!
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='error')
Exception: At least one key of into_df matched multiple rows of from_df.
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='first')
foo bar baz buz
0 1 4 70.0 0.0
1 2 5 8.0 NaN
2 3 6 80.0 1.0
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='last')
foo bar baz buz
0 1 4 70.0 0.0
1 2 5 8.0 NaN
2 3 6 40.0 4.0
顺便说一句,这是我从R的data.table包中严重错过的事情之一。使用data.table,这就像x[y, Foo := i.Foo, on = c("a", "b")]
答案 5 :(得分:0)
这是另一种应该使用 .sv
combine_first()