Question

我正在处理数据集，最终得到了以下形式的名称列表：

s = ['DR. James Coffins',
 'Zacharias Pallefas',
 'Matthew Ebnel',
 'Ranzzith Redly',
 'GEORGE GEORGIADAKIS',
 'HARISH KUMARAN K',
 'Christiaan Kraanlen, CFA',
 'Mary K. Lein, CFA, COL',
'Alexandre Cegra,  CFA,  CAIA'
 'Anna Bely']

我必须提取姓氏并将其放在单独的列表（或熊猫数据框中的列）中。但是，我对全名的多态性感到困惑，并且我是Python的新手。

可能的算法如下：

循环浏览列表中的元素。对于每个元素：              元素使用空格插入子元素。然后：

 a) If there are four or less subelements start from the beginning and 
     examine the first four subelements.
     a1) If the first subelement is larger than 2 letters then: If the 
              second subelement is larger than one letter, return the second 
              subelement. Otherwise, return the third subelement.
     a2) if the first subelement is 2 letters then drop it and repeat 
         step a1

您的建议将不胜感激。

Answer 1

跳过包含.且不在排除列表['dr', 'mr', 'mrs', 'mrs', 'miss', 'prof']中的单词后，总是抓住每一行的第二个元素怎么办

>>> exclude_tags = ['dr', 'mr', 'mrs', 'mrs', 'miss', 'prof']
>>> [[y for y in x.split() if '.' not in y and y.lower() not in exclude_tags][1].rstrip(',').capitalize() for x in s]
['Coffins', 'Pallefas', 'Ebnel', 'Redly', 'Georgiadakis', 'Kumaran', 'Kraanlen', 'Lein', 'Cegra']

Answer 2

对于任何其他遇到此问题的人，请记住，通常不可能从其全名中完美提取一个人的姓氏，然后阅读Falsehoods Programmers Believe About Names

Sunitha的解决方案对于姓氏由多个令牌组成的任何人（van Gogh），姓氏不止一个（Gonzalez Ramirez），姓氏具有多个令牌的人都会失败（Mary Jane Watson），选择将他们的中间名放在创建此列表的任何系统中，都来自亚洲文化，在该文化中，给定名称/姓氏的顺序有时会颠倒，等等。

使用Python /熊猫以及可能的正则表达式从全名列表中提取姓氏

2 个答案: