我想将个人名字分成多个字符串。我能够很容易地提取名字和姓氏,但是在提取中间名或名字时遇到了问题,因为在每种情况下,中间名和姓氏都非常不同。
数据如下:
ID| Complete_Name | Type
1 | JERRY, Ben | "I"
2 | VON HELSINKI, Olga | "I"
3 | JENSEN, James Goodboy Dean | "I"
4 | THE COMPANY | "C"
5 | CRUZ, Juan S. de la | "I"
在此,只有名字和姓氏的名称以及介于两个或两个中间名之间的名称。如何从Pandas数据框中提取中间名?我已经可以提取名字和姓氏了。
df = pd.read_csv("list.pip", sep="|")
df["First Name"] =
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])
所需的结果应如下所示:
ID| Complete_Name | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben | "I" | Ben | | JERRY
2 | VON HELSINKI, Olga | "I" | Olga | |
3 | JENSEN, James Goodboy Dean | "I" | James | Goodboy Dean| VON HELSINKI
4 | THE COMPANY | "C" | | |
5 | CRUZ, Juan S. de la | "I" | Juan | S. de la | CRUZ
答案 0 :(得分:5)
单个str.extract
通话将在这里工作:
p = r'^(?P<Last_Name>.*), (?P<First_Name>\S+)\b\s*(?P<Middle_Name>.*)'
u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
pd.concat([df, u], axis=1).fillna('')
ID Complete_Name Type Last_Name First_Name Middle_Name
0 1 JERRY, Ben I JERRY Ben
1 2 VON HELSINKI, Olga I VON HELSINKI Olga
2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la
正则表达式细分
^ # Start-of-line
(?P<Last_Name> # First named capture group - Last Name
.* # Match anything until...
)
, # ...we see a comma
\s # whitespace
(?P<First_Name> # Second capture group - First Name
\S+ # Match all non-whitespace characters
)
\b # Word boundary
\s* # Optional whitespace chars (mostly housekeeping)
(?P<Middle_Name> # Third capture group - Zero of more middle names
.* # Match everything till the end of string
)
答案 1 :(得分:3)
我认为您可以做到:
# take the complete_name column and split it multiple times
df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
.split(',', expand=True)
.fillna(''))
# remove extra spaces
for x in df2.columns:
df2[x] = [x.strip() for x in df2[x]]
# split the name on first space and join it
df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
df2.columns = ['last','first','middle']
# join the data frames
df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)
# rearrange columns - not necessary though
df = df[['ID','Complete_Name','first','middle','last']]
# remove none values
df = df.replace([None], '')
ID Complete_Name Type first middle last
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
答案 2 :(得分:1)
这是另一个使用简单的lambda功能的答案。
import numpy as np
import pandas as pd
""" Create data and data frame """
info_dict = {
'ID': [1,2,3,4,5,],
'Complete_Name':[
'JERRY, Ben',
'VON HELSINKI, Olga',
'JENSEN, James Goodboy Dean',
'THE COMPANY',
'CRUZ, Juan S. de la',
],
'Type':['I','I','I','C','I',],
}
data = pd.DataFrame(info_dict, columns = info_dict.keys())
""" List of columns to add """
name_cols = [
'First Name',
'Middle Name',
'Last Name',
]
"""
Use partition() to separate first and middle names into Pandas series.
Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
values that we want.
"""
NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())
# We can use index positions to quickly add columns to the dataframe.
# The partition() function will keep the delimited value in the 1 index, so we'll use
# the 0 and 2 index positions for first and middle names.
data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]
# Finally, we'll add our Last Names column
data[name_cols[2]] = LAST_NAMES
# Optional: We can replace all blank values with numpy.NaN values using regular expressions.
data = data.replace(r'^$', np.NaN, regex=True)
然后,您应该得到这样的内容:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben NaN JERRY
1 2 VON HELSINKI, Olga I Olga NaN VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C NaN NaN NaN
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
或者,用空白字符串替换NaN值:
data = data.replace(np.NaN, r'', regex=False)
然后您拥有:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ