假设我有2个数据框,其中包含城市名称但格式不同。所以,我想根据他们的状态和每个城市名称的前四个字符来匹配它们。一个小例子如下:
import pandas as pd
df1 = pd.DataFrame({'city': ['NEW YORK', 'DALLAS', 'LOS ANGELES', 'SAN FRANCISCO'],
'state' : ['NY', 'TX', 'CA', 'CA'],
'value' : [1,2,3,4]})
df2 = pd.DataFrame({'city': ['NEW YORK CITY', 'DALLAS/ABC', 'LOS ANG', 'ABC'],
'state': ['NY', 'TX', 'CA', 'CA'],
'temp': [20,21,21,23]})
df1
city state value
0 NEW YORK NY 1
1 DALLAS TX 2
2 LOS ANGELES CA 3
3 SAN FRANCISCO CA 4
df2
city state temp
0 NEW YORK CITY NY 20
1 DALLAS/ABC TX 21
2 LOS ANG CA 21
3 ABC CA 23
我想要的是一个数据帧,如下所示:
city state temp values
0 NEW YORK NY 20 1
1 DALLAS TX 21 2
2 LOS ANG CA 21 3
现在,由于这将导致城市名称不匹配,因此我无法使用isin()
。到目前为止,我正在考虑使用str.contains
,但想不出有效的方法来做到这一点。
非常感谢帮助。
答案 0 :(得分:1)
创建一个包含4个字符的临时city4
列,以使用merge
In [5247]: pd.merge(df1.assign(city4=df1.city.str[:4]),
df2.assign(city4=df2.city.str[:4]),
on=['city4', 'state']).drop('city4', 1)
Out[5247]:
city_x state value city_y temp
0 NEW YORK NY 1 NEW YORK CITY 20
1 DALLAS TX 2 DALLAS/ABC 21
2 LOS ANGELES CA 3 LOS ANG 21
更具体地说。
In [5251]: (pd.merge(df1.assign(city4=df1.city.str[:4]),
...: df2.assign(city4=df2.city.str[:4]),
...: on=['city4', 'state'])
.drop(['city4', 'city_y'], 1)
.rename(columns={'city_x': 'city'}))
Out[5251]:
city state value temp
0 NEW YORK NY 1 20
1 DALLAS TX 2 21
2 LOS ANGELES CA 3 21
详细
In [5255]: df1.assign(city4=df1.city.str[:4])
Out[5255]:
city state value city4
0 NEW YORK NY 1 NEW
1 DALLAS TX 2 DALL
2 LOS ANGELES CA 3 LOS
3 SAN FRANCISCO CA 4 SAN
In [5256]: df2.assign(city4=df2.city.str[:4])
Out[5256]:
city state temp city4
0 NEW YORK CITY NY 20 NEW
1 DALLAS/ABC TX 21 DALL
2 LOS ANG CA 21 LOS
3 ABC CA 23 ABC
答案 1 :(得分:0)
通过使用状态和4个城市字母创建密钥来使用map的一种方式,即
one = df1.state+df1.city.str[:4]
two = df2.state+df2.city.str[:4]
df1['temp']=(one).map(df2.set_index(two)['temp'].to_dict())
df1 = df1.dropna()
city state value temp 0 NEW YORK NY 1 20.0 1 DALLAS TX 2 21.0 2 LOS ANGELES CA 3 21.0