我有两个名称的数据框。数据帧更长,但我以top3为例。
First list name examples:
JOSEPH W. JOHN
MIMI N. ALFORD
WANG E. Li
Second list name examples:
AAMIR, DENNIS M
MAHAMMED, LINDA X
ABAD, FARLEY J
我需要在这两个df中提取名字,如何在一个正则表达式中提取它们。
The return should be
list 1
JOSHEPH
MIMI
WANT
list 2
DNNIES
LINDA
FARLEY
我当前的代码看起来像re.search(r'(?<=,)\w+', df['name'])
,但是没有返回正确的名称。是否可以用Python编写两个正则表达式代码来提取这些名称?
答案 0 :(得分:2)
使用
df['First Name'] = df['name'].str.extract(r'(?:(?<=^(?!.*,))|(?<=, ))([A-Z]+)', expand=False)
请参见proof
说明
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
, ','
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
, ', '
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
答案 1 :(得分:1)
您似乎要在这里查找的是第一个单词字符序列,该单词字符 在行中之后的任何地方都没有逗号,而不是一个确实之前有一个逗号。因此,您似乎希望使用否定的前瞻性断言,而不是肯定的前瞻性断言。
尝试将其用作正则表达式:
r'\w+(?!.*,)'
使用以下方法进行应用:
df['name'].apply(lambda name:re.search(r'\w+(?!.*,)',name).group())
将以上内容应用于此示例数据框:
name foo
0 JOSEPH W. JOHN 1
1 MIMI N. ALFORD 3
2 WANG E. Li 3
3 AAMIR, DENNIS M 3
4 MAHAMMED, LINDA X 3
5 ABAD, FARLEY J 3
给予:
0 JOSEPH
1 MIMI
2 WANG
3 DENNIS
4 LINDA
5 FARLEY