我想比较然后根据字符串中的公共序列将两个数据帧与字符串连接。
数据如下:
DATA1:
Kansas
Sacramento
Miami
Toronto
DATA2
Kansas_county
Sacramento_county
Miami_county
Vegas_county
期望的结果是:
col_data1 col_data2
Kansas Kansas_county
Sacramento Sacramento_county
Miami Miami_county
Toronto N/A
N/A Vegas_county
问题是:
提前多多感谢。
答案 0 :(得分:0)
您可以向第一个DataFrame添加新列,然后使用pandas.merge():
>>> df1 = pd.DataFrame({'col':['Kansas', 'Sacramento', 'Miami', 'Toronto']})
>>> df2 = pd.DataFrame({'col':['Kansas_county', 'Sacramento_county', 'Miami_county', 'Vegas_county']})
>>>
>>> df1['county'] = df1['col'] + '_county'
>>>
>>> dfN = pd.merge(df1, df2, how='outer', left_on='county', right_on='col', suffixes=['_data1', '_data2'])
>>>
>>> del dfN['county']
>>> dfN
col_data1 col_data2
0 Kansas Kansas_county
1 Sacramento Sacramento_county
2 Miami Miami_county
3 Toronto NaN
4 NaN Vegas_county
答案 1 :(得分:0)
基于Roman的答案,您可以定义一个函数来格式化您的一个数据列,例如
In [105]: df1 = pd.DataFrame({'col':['Kansas', 'Sacramento', 'Miami', 'Toronto']})
In [106]: df2 = pd.DataFrame({'col':['Kansas_county', 'Sacramento_county', 'Miami_county', 'Vegas_county']})
In [107]: def f(x,delm='_'):
... return x.split(delm)[0]
In [108]: df2['map_index'] = df2.col.map(lambda x: f(x))
In [109]: df2
Out[109]:
col map_index
0 Kansas_county Kansas
1 Sacramento_county Sacramento
2 Miami_county Miami
3 Vegas_county Vegas
In [110]: dfN = pd.merge(df1, df2, how='outer', left_on='col', right_on='map_index')
In [111]: dfN
Out[111]:
col_x col_y map_index
0 Kansas Kansas_county Kansas
1 Sacramento Sacramento_county Sacramento
2 Miami Miami_county Miami
3 Toronto NaN NaN
4 NaN Vegas_county Vegas
这与Roman概述的基本相同,但是为您提供了更通用的格式化功能(通过您想要放在函数中的任何内容,包括正则表达式解析等)。