我有一些要清理的地址。
您可以看到在address1
列中,我们有一些条目只是数字,它们应该是数字和街道名称,例如前三行。
df = pd.DataFrame({'address1':['15 Main Street','10 High Street','5 Other Street',np.nan,'15','12'],
'address2':['New York','LA','London','Tokyo','Grove Street','Garden Street']})
print(df)
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street
5 12 Garden Street
我正在尝试创建一个函数来检查address1
是否为数字,如果是,请从address1
合并address2
和街道名称,然后删除{{1} }。
我的预期输出是这个。我们可以看到索引4和5现在具有完整的address2
条目:
address1
我尝试使用.apply()函数进行的操作:
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street NaN <---
5 12 Garden Street NaN <---
应用功能:
def f(x):
try:
#if address1 is int
if isinstance(int(x['address1']), int):
# create new address using address1 + address 2
newaddress = str(x['address1']) +' '+ str(x['address2'])
# delete address2
x['address2'] = np.nan
# return newaddress to address1 column
return newadress
except:
pass
但是,列df['address1'] = df.apply(f,axis=1)
现在全部为address1
。
我已经尝试了一些此功能的变体,但无法使其正常工作。不胜感激建议。
答案 0 :(得分:1)
您可以创建遮罩并进行更新:
mask = pd.to_numeric(df.address1, errors='coerce').notna()
df.loc[mask, 'address1'] = df.loc[mask, 'address1'] + ' ' +df.loc[mask,'address2']
df.loc[mask, 'address2'] = np.nan
输出:
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street NaN
5 12 Garden Street NaN
答案 1 :(得分:1)
尝试一下
应用try除外,并将address1转换为int
def test(row):
try:
address = int(row['address1'])
return 1
except:
return 0
df['address1'] = np.where(df['test']==1,df['address1']+ ' '+df['address2'],df['address1'])
df['address2'] = np.where(df['test']==1,np.nan,df['address2'])
df.drop(['test'],axis=1,inplace=True)
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street NaN
5 12 Garden Street NaN
答案 2 :(得分:1)
您可以使用apply
来选择需要修改的确切行,从而避免使用str.isdigit
。创建掩码m
以标识这些行。在这些行上使用agg
,并为这些行构造一个子数据框。最后append
回到原始的df
m = df.address1.astype(str).str.isdigit()
df1 = df[m].agg(' '.join, axis=1).to_frame('address1').assign(address2=np.nan)
Out[179]:
address1 address2
4 15 Grove Street NaN
5 12 Garden Street NaN
最后,append
回到df
df[~m].append(df1)
Out[200]:
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street NaN
5 12 Garden Street NaN
如果您仍然坚持使用apply
,则需要修改f
才能返回if
之外,以返回未修改的行和已修改的行
def f(x):
y = x.copy()
try:
#if address1 is int
if isinstance(int(x['address1']), int):
# create new address using address1 + address 2
y['address1'] = str(x['address1']) +' '+ str(x['address2'])
# delete address2
y['address2'] = np.nan
except:
pass
return y
df.apply(f, axis=1)
Out[213]:
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street NaN
5 12 Garden Street NaN
注意:建议apply
不应修改传递的对象,因此我做y = x.copy()
并修改并返回y