正则表达式以清理名称

时间:2020-10-08 19:25:34

标签: python regex

我有两个名称的数据框。数据帧更长,但我以top3为例。

First list name examples: 
JOSEPH W. JOHN
MIMI N. ALFORD
WANG E. Li

Second list name examples:
AAMIR, DENNIS M
MAHAMMED, LINDA X
ABAD, FARLEY J

我需要在这两个df中提取名字,如何在一个正则表达式中提取它们。

The return should be 
list 1
JOSHEPH 
MIMI
WANT

list 2
DNNIES
LINDA
FARLEY

我当前的代码看起来像re.search(r'(?<=,)\w+', df['name']),但是没有返回正确的名称。是否可以用Python编写两个正则表达式代码来提取这些名称?

2 个答案:

答案 0 :(得分:2)

使用

df['First Name'] = df['name'].str.extract(r'(?:(?<=^(?!.*,))|(?<=, ))([A-Z]+)', expand=False)

请参见proof

说明

--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
      ^                        the beginning of the string
--------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
        .*                       any character except \n (0 or more
                                 times (matching the most amount
                                 possible))
--------------------------------------------------------------------------------
        ,                        ','
--------------------------------------------------------------------------------
      )                        end of look-ahead
--------------------------------------------------------------------------------
    )                        end of look-behind
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
      ,                        ', '
--------------------------------------------------------------------------------
    )                        end of look-behind
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1

答案 1 :(得分:1)

您似乎要在这里查找的是第一个单词字符序列,该单词字符 在行中之后的任何地方都没有逗号,而不是一个确实之前有一个逗号。因此,您似乎希望使用否定的前瞻性断言,而不是肯定的前瞻性断言。

尝试将其用作正则表达式:

r'\w+(?!.*,)'

使用以下方法进行应用:

df['name'].apply(lambda name:re.search(r'\w+(?!.*,)',name).group())

将以上内容应用于此示例数据框:

                name   foo
0     JOSEPH W. JOHN     1
1     MIMI N. ALFORD     3
2         WANG E. Li     3
3    AAMIR, DENNIS M     3
4  MAHAMMED, LINDA X     3
5     ABAD, FARLEY J     3

给予:

0    JOSEPH
1      MIMI
2      WANG
3    DENNIS
4     LINDA
5    FARLEY