我有两个数据框main_df
:
| header_1
0 | value_1
1 | value_2
2 | value_3
3 | value_1
查询数据框lookup_df
:
| header_1 | header_2
0 | value_1 | lookup_value_1
1 | value_2 | lookup_value_2
2 | value_3 | lookup_value_3
3 | value_4 | lookup_value_4
main_df
中的值不是唯一的。 `lookup_df'中的值是唯一的。
我只想在main
df中使用lookup_value
中相应的lookup_df
填充新列。
尝试了各种方法,包括.merge
,.join
,.map
和.lookup
。
main_df = pd.merge(main_df, lookup_df, how='inner', on=['header_1'])
我正在寻找的结果是:
| header_1 | header_2
0 | value_1 | lookup_value_1
1 | value_2 | lookup_value_2
2 | value_3 | lookup_value_3
3 | value_1 | lookup_value_1
答案 0 :(得分:2)
您可以Series
使用map
:
main_df['header_2'] = main_df['header_1'].map(lookup_df.set_index('header_1')['header_2'])
print (main_df)
header_1 header_2
0 value_1 lookup_value_1
1 value_2 lookup_value_2
2 value_3 lookup_value_3
3 value_1 lookup_value_1
转换Series
to_dict
:
main_df['header_2'] = main_df['header_1'].map(lookup_df.set_index('header_1')['header_2']
.to_dict())
print (main_df)
header_1 header_2
0 value_1 lookup_value_1
1 value_2 lookup_value_2
2 value_3 lookup_value_3
3 value_1 lookup_value_1
<强>计时强>:
#[400000 rows x 1 columns]
main_df = pd.concat([main_df]*100000).reset_index(drop=True)
In [139]: %timeit pd.merge(main_df, lookup_df, how='left', on=['header_1'])
10 loops, best of 3: 73.1 ms per loop
In [140]: %timeit main_df['header_1'].map(lookup_df.set_index('header_1')['header_2'])
10 loops, best of 3: 35.7 ms per loop
In [141]: %timeit main_df['header_1'].map(lookup_df.set_index('header_1')['header_2'].to_dict())
10 loops, best of 3: 35.1 ms per loop
编辑:
header_1
中您需要lookup_df
列的唯一值,一个可能的解决方案是drop_duplicates
:
print (lookup_df)
header_1 header_2
0 value_1 lookup_value_1
1 value_2 lookup_value_2
2 value_3 lookup_value_3
3 value_1 lookup_value_4
#keep first value, default parameter keep='first'
lookup_df = lookup_df.drop_duplicates(['header_1'])
print (lookup_df)
header_1 header_2
0 value_1 lookup_value_1
1 value_2 lookup_value_2
2 value_3 lookup_value_3
#keep last value
lookup_df1 = lookup_df.drop_duplicates(['header_1'], keep='last')
print (lookup_df1)
header_1 header_2
0 value_1 lookup_value_1
1 value_2 lookup_value_2
2 value_3 lookup_value_3
答案 1 :(得分:1)
你必须在没有&#39;如何&#39;的情况下进行合并。关键词。像这样:
main_df = pd.DataFrame([{'header_1': 'value_1'},{'header_1': 'value_2'},{'header_1': 'value_3'},{'header_1': 'value_1'}])
lookup_df = pd.DataFrame([{'header_1':'value_1', 'header_2':'lookup_value_1'}, {'header_1':'value_2', 'header_2':'lookup_value_2'}, {'header_1':'value_3', 'header_2':'lookup_value_3'}, {'header_1':'value_4', 'header_2':'lookup_value_4'}])
main_df = pd.merge(main_df, lookup_df, on='header_1')
输出
header_1 header_2
0 value_1 lookup_value_1
1 value_1 lookup_value_1
2 value_2 lookup_value_2
3 value_3 lookup_value_3