Question

我有两个名称的数据框。数据帧更长，但我以top3为例。

First list name examples: 
JOSEPH W. JOHN
MIMI N. ALFORD
WANG E. Li

Second list name examples:
AAMIR, DENNIS M
MAHAMMED, LINDA X
ABAD, FARLEY J

我需要在这两个df中提取名字，如何在一个正则表达式中提取它们。

The return should be 
list 1
JOSHEPH 
MIMI
WANT

list 2
DNNIES
LINDA
FARLEY

我当前的代码看起来像re.search(r'(?<=,)\w+', df['name'])，但是没有返回正确的名称。是否可以用Python编写两个正则表达式代码来提取这些名称？

Answer 1

使用

df['First Name'] = df['name'].str.extract(r'(?:(?<=^(?!.*,))|(?<=, ))([A-Z]+)', expand=False)

请参见proof

说明

--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
      ^                        the beginning of the string
--------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
        .*                       any character except \n (0 or more
                                 times (matching the most amount
                                 possible))
--------------------------------------------------------------------------------
        ,                        ','
--------------------------------------------------------------------------------
      )                        end of look-ahead
--------------------------------------------------------------------------------
    )                        end of look-behind
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
      ,                        ', '
--------------------------------------------------------------------------------
    )                        end of look-behind
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1

Answer 2

您似乎要在这里查找的是第一个单词字符序列，该单词字符在行中之后的任何地方都没有逗号，而不是一个确实之前有一个逗号。因此，您似乎希望使用否定的前瞻性断言，而不是肯定的前瞻性断言。

尝试将其用作正则表达式：

r'\w+(?!.*,)'

使用以下方法进行应用：

df['name'].apply(lambda name:re.search(r'\w+(?!.*,)',name).group())

将以上内容应用于此示例数据框：

name foo 0 JOSEPH W. JOHN 1 1 MIMI N. ALFORD 3 2 WANG E. Li 3 3 AAMIR, DENNIS M 3 4 MAHAMMED, LINDA X 3 5 ABAD, FARLEY J 3

给予：

0 JOSEPH 1 MIMI 2 WANG 3 DENNIS 4 LINDA 5 FARLEY

正则表达式以清理名称

2 个答案: