我有一个没有邮政编码的地址数据框:
df1 = pd.DataFrame({'address1':['1 o\'toole st','2 main st','3 high street','5 foo street','10 foo street'],
'address2':['town1',np.nan,np.nan,'Bartown',np.nan],
'address3':[np.nan,'village','city','county2','county3']})
df1['zipcode']=''
df1
address1 address2 address3 zipcode
0 1 o'toole st town1 NaN
1 2 main st NaN village
2 3 high street NaN city
3 5 foo street Bartown county2
4 10 foo street NaN county3
还有第二个数据框,其中包含地址和邮政编码。请注意,这与df1
的顺序相同,但是在我正在使用的真实数据中却并非如此:
df2 = pd.DataFrame({'address1':['1 o\'toole st','2 main st','7 mill street','5 foo street','10 foo street'],
'address2':['town1','village','city','Bartown','county3'],
'address3':[np.nan,np.nan,np.nan,'county2','USA'],
'zipcode': ['er45','qw23','rt67','yu89','yu83']})
df2
address1 address2 address3 zipcode
0 1 o'toole st town1 NaN er45
1 2 main st village NaN qw23
2 7 mill street city NaN rt67
3 5 foo street Bartown county2 yu89
4 10 foo street county3 USA yu83
我想检查df1
中的地址是否在df2
中,如果是,请将邮政编码拖到df1
中。
这是我遇到麻烦的地方,不确定这是否是解决问题的最佳方法。
到目前为止,我要做的是使用地址的前两行address 1
和address 2
为两个数据帧创建一个主键,剥离所有空白和非字母,转换为较低的:
df1['key'] = (df1['address1'] + df1['address2']).str.lower().str.replace(' ', '').str.replace('\W', '')
df2['key'] = (df2['address1'] + df2['address2']).str.lower().str.replace(' ', '').str.replace('\W', '')
print(df1)
address1 address2 address3 zipcode key
0 1 o'toole st town1 NaN 1otoolesttown1
1 2 main st NaN village NaN
2 3 high street NaN city NaN
3 5 foo street Bartown county2 5foostreetbartown
4 10 foo street NaN county3 NaN
print(df2)
address1 address2 address3 zipcode key
0 1 o'toole st town1 NaN er45 1otoolesttown1
1 2 main st village NaN qw23 2mainstvillage
2 7 mill street city NaN rt67 7millstreetcity
3 5 foo street Bartown county2 yu89 5foostreetbartown
4 10 foo street county3 USA yu83 10foostreetcounty3
现在,我将使用np.where
将信息拖到df1中的空zipcode
列,如果找不到匹配的地址,则返回no_match
:
df1['zipcode'] = np.where(df1['key'].isin(df2['key']), df2['zipcode'], 'no_match')
print(df1)
address1 address2 address3 zipcode key
0 1 o'toole st town1 NaN er45 1otoolesttown1
1 2 main st NaN village no_match NaN
2 3 high street NaN city no_match NaN
3 5 foo street Bartown county2 yu89 5foostreetbartown
4 10 foo street NaN county3 no_match NaN
我的问题是为df1创建的key
。如您所见,其中一些是NaN
。这是由于地址格式与df2
不同。这就是我目前正在使用的数据集。
我试图通过跳过任何NaN
并添加下一行来解决此问题,但出现ValueError:
# add address1 + address2 if it's not null, otherwise use address3
df1['key'] = (df1['address1'] + (df1['address2'] if pd.notnull(df1['address2']) else df1['address3']))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
对于如何解决此问题的任何反馈或建议,我们深表感谢。如果有更简便的方法,我很想知道。
答案 0 :(得分:4)
使用Series.fillna
将丢失的值替换为df1['address3']
:
df1['key'] = df1['address1'] + df1['address2'].fillna(df1['address3'])
相反:
df1['key'] = (df1['address1'] + (df1['address2'] if
pd.notnull(df1['address2']) else df1['address3']))
有关您的错误的更多信息,请参见using if truth statements with-pandas。
答案 1 :(得分:1)
我首先将NaN值替换为空字符串,然后将3个地址列连接起来以在一个列中获取地址,就像您所做的那样:
# filling NaN values
df1.fillna('', inplace=True)
df2.fillna('', inplace=True)
# concatenate the address columns
df1['address'] = df1['address1']+df1['address2']+df1['address3']
df2['address'] = df2['address1']+df2['address2']+df2['address3']
然后将新的“地址”列设置为两个DataFrame中的索引:
df1.set_index('address', inplace=True)
df2.set_index('address', inplace=True)
最后将邮政编码添加到df1:
df1['zipcode'] = df2['zipcode']
这是结果:
address1 address2 address zipcode
address
1 o'toole sttown1 1 o'toole st town1 er45
2 main stvillage 2 main st village qw23
3 high streetcity 3 high street city NaN
5 foo streetBartowncounty2 5 foo street Bartown county2 yu89
10 foo streetcounty3 10 foo street county3 yu89
答案 2 :(得分:1)
您的问题是这一行:
df1['key'] = (df1['address1'] + (df1['address2'] if pd.notnull(df1['address2']) else df1['address3']))
此处使用的if
会导致错误,因为pd.notnull
生成一个布尔序列,但是if
运算符需要一个布尔值。
您可以使用pandas.Series.where解决此问题:
df1['key'] = (df1['address1'] +
df1['address2'].where(pd.notnull(df1['address2']), df1['address3'])) \
.str.lower().str.replace(' ', '').str.replace('\W', '')
这将使用您要查找的键生成一个df1
:
address1 address2 address3 key
0 1 o'toole st town1 NaN 1otoolesttown1
1 2 main st NaN village 2mainstvillage
2 3 high street NaN city 3highstreetcity
3 5 foo street Bartown county2 5foostreetbartown
4 10 foo street NaN county3 10foostreetcounty3
现在您可以合并邮政编码了。