如何有效地从熊猫系列中提取(子)弦

时间:2019-08-04 12:19:09

标签: python-3.x string pandas

我正在研究Kaggle上的Titanic数据集:

https://www.kaggle.com/c/titanic/data

我正在尝试处理乘客姓名中包含的所有标题。

我可以使用“包含”方法进行过滤以显示值:

train[~train.Name.str.contains('Mr.|Mrs.|Miss.|Master.|Dr.|Rev.|Jonkheer.|Countess.|Major.|Col.|Capt.|Don.|Mme.|Mlle.')]['Name']

并显示我尚未捕获的内容:

443    Reynaldo, Ms. Encarnacion
Name: Name, dtype: object

所以我创建了一个映射器函数来创建另一个功能:

## title mapper function
def title_mapper(x):
    if x.contains('Mr.'):
        return 'Mr'
    elif x.contains('Mrs.|Mme.'):
        return 'Mrs'
    elif x.contains('Miss.|Mlle.'):
        return 'Miss'
    elif x.contains('Dr.'):
        return 'Dr'
    elif x.contains('Rev.'):
        return 'Cleric'
    elif x.contains('Jonkheer.|Countess.|Don.|Ms.'):
        return 'Noble'
    elif x.contains('Major.|Col.|Capt.'):
        return 'Military'
    else:
        return 'Other'

但是它声称没有属性包含:

train['Title'] = train['Name'].apply(lambda x: title_mapper(x))


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-63-7c9804f87141> in <module>
     20         return 'Other'
     21 
---> 22 train['Title'] = train['Name'].apply(lambda x: title_mapper(x))

~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   3589             else:
   3590                 values = self.astype(object).values
-> 3591                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   3592 
   3593         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-63-7c9804f87141> in <lambda>(x)
     20         return 'Other'
     21 
---> 22 train['Title'] = train['Name'].apply(lambda x: title_mapper(x))

<ipython-input-63-7c9804f87141> in title_mapper(x)
      3 ## title mapper function
      4 def title_mapper(x):
----> 5     if x.contains('Mr.'):
      6         return 'Mr'
      7     elif x.contains('Mrs.|Mme.'):

AttributeError: 'str' object has no attribute 'contains'

查看了此问题和答案并进行了调整:

Does Python have a string 'contains' substring method?

但是据我了解,即使字符串前面有r'',也无法传递这样的多个模式。使用Python 3.7

'Capt.|Col.'

仅当对每个值进行硬编码时才起作用,但是有没有办法更好/更有效地做到这一点?

## title mapper function
def title_mapper(x):
    if 'Mr.' in x:
        return 'Mr'
    elif 'Mrs.' in x:
        return 'Mrs'
    elif 'Mme.' in x:
        return 'Mrs'
    elif 'Miss.' in x:
        return 'Miss'
    elif 'Mlle.' in x:
        return 'Miss'
    elif 'Dr.' in x:
        return 'Dr'
    elif 'Rev.' in x:
        return 'Cleric'
    elif 'Jonkheer.' in x:
        return 'Noble'
    elif 'Countess.' in x:
        return 'Noble'
    elif 'Don.' in x:
        return 'Noble'
    elif 'Ms.' in x:
        return 'Noble'
    elif 'Major.' in x:
        return 'Military'
    elif 'Col.' in x:
        return 'Military'
    elif 'Capt.' in x:
        return 'Military'
    else:
        return 'Other'

train['Title'] = train['Name'].apply(lambda x: title_mapper(x))

1 个答案:

答案 0 :(得分:1)

如果性能很重要,请使用最后一个解决方案。也可以将其重映射为mapper的字典:

d = {'Mr':['Mr.'],
     'Mrs':['Mrs.',' Mme.'],
     'Miss':['Miss.','Mlle.'],
     'Dr':['Dr.'],
     'Cleric':['Rev.'],
     'Noble':['Jonkheer.','Countess.','Don.','Ms.'],
     'Military': ['Major.','Col.', 'Capt.']}
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}

def title_mapper1(x):
    for k, v in d1.items():
        if k in x:
            return v

train['Title1'] = train['Name'].apply(title_mapper1).fillna('Other')