尝试将邮政编码从一个数据帧拉入地址的另一个数据帧

时间:2019-07-17 09:37:04

标签: python pandas dataframe

我有一个没有邮政编码的地址数据框:

df1 = pd.DataFrame({'address1':['1 o\'toole st','2 main st','3 high street','5 foo street','10 foo street'],
                   'address2':['town1',np.nan,np.nan,'Bartown',np.nan],
                   'address3':[np.nan,'village','city','county2','county3']})
df1['zipcode']=''
df1

        address1 address2 address3 zipcode
0   1 o'toole st    town1      NaN        
1      2 main st      NaN  village        
2  3 high street      NaN     city        
3   5 foo street  Bartown  county2        
4  10 foo street      NaN  county3 

还有第二个数据框,其中包含地址和邮政编码。请注意,这与df1的顺序相同,但是在我正在使用的真实数据中却并非如此:

df2 = pd.DataFrame({'address1':['1 o\'toole st','2 main st','7 mill street','5 foo street','10 foo street'],
                   'address2':['town1','village','city','Bartown','county3'],
                   'address3':[np.nan,np.nan,np.nan,'county2','USA'],
                   'zipcode': ['er45','qw23','rt67','yu89','yu83']})
df2

        address1 address2 address3 zipcode
0   1 o'toole st    town1      NaN    er45
1      2 main st  village      NaN    qw23
2  7 mill street     city      NaN    rt67
3   5 foo street  Bartown  county2    yu89
4  10 foo street  county3      USA    yu83

我想检查df1中的地址是否在df2中,如果是,请将邮政编码拖到df1中。

这是我遇到麻烦的地方,不确定这是否是解决问题的最佳方法。

到目前为止,我要做的是使用地址的前两行address 1address 2为两个数据帧创建一个主键,剥离所有空白和非字母,转换为较低的:

df1['key'] = (df1['address1'] + df1['address2']).str.lower().str.replace(' ', '').str.replace('\W', '')


df2['key'] = (df2['address1'] + df2['address2']).str.lower().str.replace(' ', '').str.replace('\W', '')


print(df1)

        address1 address2 address3 zipcode                key
0   1 o'toole st    town1      NaN             1otoolesttown1
1      2 main st      NaN  village                        NaN
2  3 high street      NaN     city                        NaN
3   5 foo street  Bartown  county2          5foostreetbartown
4  10 foo street      NaN  county3                        NaN

print(df2)

        address1 address2 address3 zipcode                 key
0   1 o'toole st    town1      NaN    er45      1otoolesttown1
1      2 main st  village      NaN    qw23      2mainstvillage
2  7 mill street     city      NaN    rt67     7millstreetcity
3   5 foo street  Bartown  county2    yu89   5foostreetbartown
4  10 foo street  county3      USA    yu83  10foostreetcounty3

现在,我将使用np.where将信息拖到df1中的空zipcode列,如果找不到匹配的地址,则返回no_match

df1['zipcode'] = np.where(df1['key'].isin(df2['key']), df2['zipcode'], 'no_match')

print(df1)

        address1 address2 address3   zipcode                key
0   1 o'toole st    town1      NaN      er45     1otoolesttown1
1      2 main st      NaN  village  no_match                NaN
2  3 high street      NaN     city  no_match                NaN
3   5 foo street  Bartown  county2      yu89  5foostreetbartown
4  10 foo street      NaN  county3  no_match                NaN

我的问题是为df1创建的key。如您所见,其中一些是NaN。这是由于地址格式与df2不同。这就是我目前正在使用的数据集。

我试图通过跳过任何NaN并添加下一行来解决此问题,但出现ValueError:

# add address1 + address2 if it's not null, otherwise use address3

df1['key'] = (df1['address1'] + (df1['address2'] if pd.notnull(df1['address2']) else df1['address3']))

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

对于如何解决此问题的任何反馈或建议,我们深表感谢。如果有更简便的方法,我很想知道。

3 个答案:

答案 0 :(得分:4)

使用Series.fillna将丢失的值替换为df1['address3']

df1['key'] = df1['address1'] + df1['address2'].fillna(df1['address3'])

相反:

df1['key'] = (df1['address1'] + (df1['address2'] if 
                                   pd.notnull(df1['address2']) else df1['address3']))

有关您的错误的更多信息,请参见using if truth statements with-pandas

答案 1 :(得分:1)

我首先将NaN值替换为空字符串,然后将3个地址列连接起来以在一个列中获取地址,就像您所做的那样:

# filling NaN values
df1.fillna('', inplace=True)
df2.fillna('', inplace=True)

# concatenate the address columns
df1['address'] = df1['address1']+df1['address2']+df1['address3']
df2['address'] = df2['address1']+df2['address2']+df2['address3']

然后将新的“地址”列设置为两个DataFrame中的索引:

df1.set_index('address', inplace=True)
df2.set_index('address', inplace=True)

最后将邮政编码添加到df1:

df1['zipcode'] = df2['zipcode']

这是结果:

                            address1       address2        address         zipcode
address                 
1 o'toole sttown1           1 o'toole st    town1                           er45
2 main stvillage            2 main st                       village         qw23
3 high streetcity           3 high street                   city            NaN
5 foo streetBartowncounty2  5 foo street    Bartown         county2         yu89
10 foo streetcounty3        10 foo street                   county3         yu89

答案 2 :(得分:1)

您的问题是这一行:

df1['key'] = (df1['address1'] + (df1['address2'] if pd.notnull(df1['address2']) else df1['address3']))

此处使用的if会导致错误,因为pd.notnull生成一个布尔序列,但是if运算符需要一个布尔值。
您可以使用pandas.Series.where解决此问题:

df1['key'] = (df1['address1'] +
             df1['address2'].where(pd.notnull(df1['address2']), df1['address3'])) \
             .str.lower().str.replace(' ', '').str.replace('\W', '')

这将使用您要查找的键生成一个df1

        address1 address2 address3                 key
0   1 o'toole st    town1      NaN      1otoolesttown1
1      2 main st      NaN  village      2mainstvillage
2  3 high street      NaN     city     3highstreetcity
3   5 foo street  Bartown  county2   5foostreetbartown
4  10 foo street      NaN  county3  10foostreetcounty3

现在您可以合并邮政编码了。