应用操作以串联数据帧中的某些行以重新调整无

时间:2019-07-18 17:12:02

标签: python pandas apply

我有一些要清理的地址。

您可以看到在address1列中,我们有一些条目只是数字,它们应该是数字和街道名称,例如前三行。

df = pd.DataFrame({'address1':['15 Main Street','10 High Street','5 Other Street',np.nan,'15','12'],
                  'address2':['New York','LA','London','Tokyo','Grove Street','Garden Street']})

print(df)

         address1       address2
0  15 Main Street       New York
1  10 High Street             LA
2  5 Other Street         London
3             NaN          Tokyo
4              15   Grove Street
5              12  Garden Street

我正在尝试创建一个函数来检查address1是否为数字,如果是,请从address1合并address2和街道名称,然后删除{{1} }。

我的预期输出是这个。我们可以看到索引4和5现在具有完整的address2条目:

address1

我尝试使用.apply()函数进行的操作:

           address1  address2
0    15 Main Street  New York
1    10 High Street        LA
2    5 Other Street    London
3               NaN     Tokyo
4   15 Grove Street       NaN <---
5  12 Garden Street       NaN <---

应用功能:

def f(x):

    try:
        #if address1 is int
        if isinstance(int(x['address1']), int):

            # create new address using address1 + address 2
            newaddress = str(x['address1']) +' '+ str(x['address2'])

            # delete address2
            x['address2'] = np.nan

            # return newaddress to address1 column
            return newadress

    except:
        pass

但是,列df['address1'] = df.apply(f,axis=1) 现在全部为address1

我已经尝试了一些此功能的变体,但无法使其正常工作。不胜感激建议。

3 个答案:

答案 0 :(得分:1)

您可以创建遮罩并进行更新:

mask = pd.to_numeric(df.address1, errors='coerce').notna()
df.loc[mask, 'address1'] = df.loc[mask, 'address1'] + ' ' +df.loc[mask,'address2']
df.loc[mask, 'address2'] = np.nan

输出:

           address1  address2
0    15 Main Street  New York
1    10 High Street        LA
2    5 Other Street    London
3               NaN     Tokyo
4   15 Grove Street       NaN
5  12 Garden Street       NaN

答案 1 :(得分:1)

尝试一下

应用try除外,并将address1转换为int

def test(row):
    try:
        address = int(row['address1'])
        return 1
    except:
        return 0


df['address1'] = np.where(df['test']==1,df['address1']+ ' '+df['address2'],df['address1'])
df['address2'] = np.where(df['test']==1,np.nan,df['address2'])
df.drop(['test'],axis=1,inplace=True)
        address1    address2
0   15 Main Street    New York
1   10 High Street    LA
2   5 Other Street    London
3   NaN               Tokyo
4   15 Grove Street   NaN
5   12 Garden Street  NaN

答案 2 :(得分:1)

您可以使用apply来选择需要修改的确切行,从而避免使用str.isdigit。创建掩码m以标识这些行。在这些行上使用agg,并为这些行构造一个子数据框。最后append回到原始的df

m = df.address1.astype(str).str.isdigit()
df1 = df[m].agg(' '.join, axis=1).to_frame('address1').assign(address2=np.nan)

Out[179]:
           address1  address2
4   15 Grove Street       NaN
5  12 Garden Street       NaN

最后,append回到df

df[~m].append(df1)

Out[200]:
           address1  address2
0    15 Main Street  New York
1    10 High Street        LA
2    5 Other Street    London
3               NaN     Tokyo
4   15 Grove Street       NaN
5  12 Garden Street       NaN

如果您仍然坚持使用apply,则需要修改f才能返回if之外,以返回未修改的行和已修改的行

def f(x):
    y = x.copy()
    try:
        #if address1 is int
        if isinstance(int(x['address1']), int):

            # create new address using address1 + address 2
            y['address1'] = str(x['address1']) +' '+ str(x['address2'])

            # delete address2
            y['address2'] = np.nan
    except:
        pass

    return y


df.apply(f, axis=1)

Out[213]:
           address1  address2
0    15 Main Street  New York
1    10 High Street        LA
2    5 Other Street    London
3               NaN     Tokyo
4   15 Grove Street       NaN
5  12 Garden Street       NaN

注意:建议apply不应修改传递的对象,因此我做y = x.copy()并修改并返回y