我正在处理大量数据,其中包括人名的标准五列(前缀,名字,中间名,姓氏,后缀),我想将它们作为可读名称合并到一个单独的列中。我遇到的问题是处理空白值 - 问题会产生间距问题。另外,我无法修改原始列。我目前的过程感觉有点疯狂(但它有效!)所以我正在寻找一个更优雅的解决方案。
我目前的代码:
def add_space_prefix(x):
x = str(x)
if len(x) > 0:
return x + ' '
else:
return x
def add_space_middle(x):
x = str(x)
if len(x) > 0:
return ' ' + x
else:
return x
def add_space_suffix(x):
x = str(x)
if len(x) > 0:
return ', ' + x
else:
return x`
df["middlename"] =
df["middlename"].map(lambda x: add_space_middle(x))
df["prefix"] = df["prefix"].map(lambda x: add_space_prefix(x))
df["suffix"] = df["suffix"].map(lambda x: add_space_suffix(x))
df['fullname'] = df["prefix"] + df["firstname"] + df[
"middlename"] + ' ' + df["lastname"] + df['suffix']
示例数据框
prefix firstname middlename lastname suffix fullname
0 Michael Hobart Jr. Michael Jobart, Jr.
1 Mr. Alan Lilt Mr. Alan Lilt
2 Jon A. Smith III Jon A. Smith, III
3 Joe Miller Joe Miller
4 Mika Jennifer Shabosky Mika Jennifer Shabosky
5 Mrs. Angela Calder Mrs. Angela Calder
6 Boris Al Bert Esq. Boris Al Bert, Esq.
7 Dr. Natasha Chorus Dr. Natasha Chorus
8 Bill Gibbons Bill Gibbons
答案 0 :(得分:4)
选项1
' '.join
和pd.Series.str
在这个解决方案中,我们用空格连接整行。这可能会导致字符串开头或结尾处的空格或中间有2个或更多空格。我们通过链接字符串访问器方法来处理这个问题。
df.assign(
lastname=df.lastname + ','
).apply(' '.join, 1).str.replace('\s+', ' ').str.strip(' ,')
0 Michael Hobart, Jr.
1 Mr. Alan Lilt
2 Jon A. Smith, III
3 Joe Miller
4 Mika Jennifer Shabosky
5 Mrs. Angela Calder
6 Boris Al Bert, Esq.
7 Dr. Natasha Chorus
8 Bill Gibbons
dtype: object
df['fullname'] = df.assign(
lastname=df.lastname + ','
).apply(' '.join, 1).str.replace('\s+', ' ').str.strip(' ,')
df
prefix firstname middlename lastname suffix fullname
0 Michael Hobart Jr. Michael Hobart, Jr.
1 Mr. Alan Lilt Mr. Alan Lilt
2 Jon A. Smith III Jon A. Smith, III
3 Joe Miller Joe Miller
4 Mika Jennifer Shabosky Mika Jennifer Shabosky
5 Mrs. Angela Calder Mrs. Angela Calder
6 Boris Al Bert Esq. Boris Al Bert, Esq.
7 Dr. Natasha Chorus Dr. Natasha Chorus
8 Bill Gibbons Bill Gibbons
选项2
列表理解
在此解决方案中,我们执行与第一个解决方案相同的活动,但我们将字符串操作捆绑在一起并在理解中。
[re.sub(r'\s+', ' ', ' '.join(s)).strip(' ,')
for s in df.assign(lastname=df.lastname + ',').values.tolist()]
['Michael Hobart, Jr.',
'Mr. Alan Lilt',
'Jon A. Smith, III',
'Joe Miller',
'Mika Jennifer Shabosky',
'Mrs. Angela Calder',
'Boris Al Bert, Esq.',
'Dr. Natasha Chorus',
'Bill Gibbons']
df['fullname'] = [re.sub(r'\s+', ' ', ' '.join(s)).strip(' ,')
for s in df.assign(lastname=df.lastname + ',').values.tolist()]
df
prefix firstname middlename lastname suffix fullname
0 Michael Hobart Jr. Michael Hobart, Jr.
1 Mr. Alan Lilt Mr. Alan Lilt
2 Jon A. Smith III Jon A. Smith, III
3 Joe Miller Joe Miller
4 Mika Jennifer Shabosky Mika Jennifer Shabosky
5 Mrs. Angela Calder Mrs. Angela Calder
6 Boris Al Bert Esq. Boris Al Bert, Esq.
7 Dr. Natasha Chorus Dr. Natasha Chorus
8 Bill Gibbons Bill Gibbons
选项3
pd.replace
和pd.DataFrame.stack
这个有点不同,因为我们用''
替换空白np.nan
,这样当我们stack
时np.nan
自然被删除。这使得' '
更加直接加入。
df.assign(
lastname=df.lastname + ','
).replace('', np.nan).stack().groupby(level=0).apply(' '.join).str.strip(',')
0 Michael Hobart, Jr.
1 Mr. Alan Lilt
2 Jon A. Smith, III
3 Joe Miller
4 Mika Jennifer Shabosky
5 Mrs. Angela Calder
6 Boris Al Bert, Esq.
7 Dr. Natasha Chorus
8 Bill Gibbons
dtype: object
df['fullname'] = df.assign(
lastname=df.lastname + ','
).replace('', np.nan).stack().groupby(level=0).apply(' '.join).str.strip(',')
df
prefix firstname middlename lastname suffix fullname
0 Michael Hobart Jr. Michael Hobart, Jr.
1 Mr. Alan Lilt Mr. Alan Lilt
2 Jon A. Smith III Jon A. Smith, III
3 Joe Miller Joe Miller
4 Mika Jennifer Shabosky Mika Jennifer Shabosky
5 Mrs. Angela Calder Mrs. Angela Calder
6 Boris Al Bert Esq. Boris Al Bert, Esq.
7 Dr. Natasha Chorus Dr. Natasha Chorus
8 Bill Gibbons Bill Gibbons
<强>时序强>
在理解中捆绑是最快的!
%timeit df.assign(fullname=df.replace('', np.nan).stack().groupby(level=0).apply(' '.join))
%timeit df.assign(fullname=df.apply(' '.join, 1).str.replace('\s+', ' ').str.strip())
%timeit df.assign(fullname=[re.sub(r'\s+', ' ', ' '.join(s)).strip() for s in df.values.tolist()])
100 loops, best of 3: 2.51 ms per loop
1000 loops, best of 3: 979 µs per loop
1000 loops, best of 3: 384 µs per loop