行函数上的熊猫列字符串方法

时间:2019-03-25 21:10:02

标签: python string pandas

我正在尝试使用字符串方法基于其他三列的条件来计算新列。

样本数据:

d = pd.DataFrame({'street1': ['1000 foo dr', '1001 bar dr', '1002 foo dr suite101', '1003 bar dr'], 
              'street2': ['city_a', np.nan, 'suite 101', 'suite 102'], 
              'city': ['city_a', 'city_b', np.nan, 'city_c']})

street1                 street2     city
1000 foo dr             city_a      city_a
1001 bar dr             NaN         city_b
1002 foo dr suite101    suite 101   NaN
1003 bar dr             suite 102   city_c

理想输出:

Address
1000 foo dr
1001 bar dr
1002 foo dr suite 101
1003 bar dr suite 102

这里的想法是

  • 如果street2city相匹配,请忽略
  • 如果street2street1的结尾匹配,请忽略
  • 否则,将street1street2连接起来

我尝试过的事情:

def address_clean(row):
    if not row['street2']:
        return row['street1']
    if row['street2'] == row['city']:
        return row['street1']
    elif row['street1'].str.replace(' ', '').find(row['street2'].str.replace(' ', '')) != -1:
        return row['street1']
    else:
        return row['street1'] + row['street2']

d.apply(lambda row: address_clean(row), axis=1).head()

这引发了我一个错误:

AttributeError: ("'str' object has no attribute 'str'", 'occurred at index 1')

似乎row[street1]string而不是pd.Series。但是,即使我从原始函数中删除了.str部分,它也变成了:

def address_clean(row):
    if not row['street2']:
        return row['street1']
    if row['street2'] == row['city']:
        return row['street1']
    elif row['street1'].replace(' ', '').find(row['street2'].replace(' ', '')) != -1:
        return row['street1']
    else:
        return row['street1'] + row['street2']

d.apply(lambda row: address_clean(row), axis=1).head()

代码向我抛出以下错误:

AttributeError: ("'float' object has no attribute 'replace'", 'occurred at index 1')

我想知道函数的哪一部分使用不正确,以及如何解决此错误。

1 个答案:

答案 0 :(得分:1)

在系列中搜索模式很容易,但是我不得不使用apply来查找一列是否以另一列的内容结尾。顺便说一句,我不得不略微更改您的数据,因为'...suite101'不会以'suite 101'结尾,除非要忽略空格。所以我用:

d = pd.DataFrame({'street1': ['1000 foo dr', '1001 bar dr', '1002 foo dr suite 101', '1003 bar dr'],
                  'street2': ['city_a', np.nan, 'suite 101', 'suite 102'],
                  'city': ['city_a', 'city_b', np.nan, 'city_c']})

print(pd.DataFrame({'Address': np.where(d.street2.str.contains('city', na=True)
               | d.apply(lambda x: x.street1.endswith(str(x.street2)), axis = 1),
               d.street1,
               d.street1.str.cat(d.street2, sep=' '))}))

给出预期的结果:

                 Address
0            1000 foo dr
1            1001 bar dr
2  1002 foo dr suite 101
3  1003 bar dr suite 102