在Pandas列中处理人名的更好/更快的方法?

时间:2017-07-07 22:31:20

标签: python pandas

我正在处理大量数据,其中包括人名的标准五列(前缀,名字,中间名,姓氏,后缀),我想将它们作为可读名称合并到一个单独的列中。我遇到的问题是处理空白值 - 问题会产生间距问题。另外,我无法修改原始列。我目前的过程感觉有点疯狂(但它有效!)所以我正在寻找一个更优雅的解决方案。

我目前的代码:

def add_space_prefix(x):
    x = str(x)
    if len(x) > 0:
        return x + ' '
    else:
        return x


def add_space_middle(x):
    x = str(x)
    if len(x) > 0:
        return ' ' + x
    else:
        return x


def add_space_suffix(x):
    x = str(x)
    if len(x) > 0:
        return ', ' + x
    else:
        return x`

df["middlename"] = 
df["middlename"].map(lambda x: add_space_middle(x))
df["prefix"] = df["prefix"].map(lambda x: add_space_prefix(x))
df["suffix"] = df["suffix"].map(lambda x: add_space_suffix(x))
df['fullname'] = df["prefix"] + df["firstname"] + df[
        "middlename"] + ' ' + df["lastname"] + df['suffix']

示例数据框

    prefix  firstname   middlename  lastname    suffix  fullname
0           Michael                 Hobart      Jr.     Michael Jobart, Jr.
1   Mr.     Alan                    Lilt                Mr. Alan Lilt
2           Jon         A.          Smith       III     Jon A. Smith, III
3           Joe                     Miller              Joe Miller
4           Mika        Jennifer    Shabosky            Mika Jennifer Shabosky
5   Mrs.    Angela                  Calder              Mrs. Angela Calder
6           Boris       Al          Bert        Esq.    Boris Al Bert, Esq.
7   Dr.     Natasha                 Chorus              Dr. Natasha Chorus
8           Bill                    Gibbons             Bill Gibbons

1 个答案:

答案 0 :(得分:4)

选项1
' '.joinpd.Series.str
在这个解决方案中,我们用空格连接整行。这可能会导致字符串开头或结尾处的空格或中间有2个或更多空格。我们通过链接字符串访问器方法来处理这个问题。

df.assign(
    lastname=df.lastname + ','
).apply(' '.join, 1).str.replace('\s+', ' ').str.strip(' ,')

0       Michael Hobart, Jr.
1             Mr. Alan Lilt
2         Jon A. Smith, III
3                Joe Miller
4    Mika Jennifer Shabosky
5        Mrs. Angela Calder
6       Boris Al Bert, Esq.
7        Dr. Natasha Chorus
8              Bill Gibbons
dtype: object
df['fullname'] = df.assign(
    lastname=df.lastname + ','
).apply(' '.join, 1).str.replace('\s+', ' ').str.strip(' ,')
df

  prefix firstname middlename  lastname suffix                fullname
0          Michael               Hobart    Jr.     Michael Hobart, Jr.
1    Mr.      Alan                 Lilt                  Mr. Alan Lilt
2              Jon         A.     Smith    III       Jon A. Smith, III
3              Joe               Miller                     Joe Miller
4             Mika   Jennifer  Shabosky         Mika Jennifer Shabosky
5   Mrs.    Angela               Calder             Mrs. Angela Calder
6            Boris         Al      Bert   Esq.     Boris Al Bert, Esq.
7    Dr.   Natasha               Chorus             Dr. Natasha Chorus
8             Bill              Gibbons                   Bill Gibbons

选项2
列表理解
在此解决方案中,我们执行与第一个解决方案相同的活动,但我们将字符串操作捆绑在一起并在理解中。

[re.sub(r'\s+', ' ', ' '.join(s)).strip(' ,')
 for s in df.assign(lastname=df.lastname + ',').values.tolist()]

['Michael Hobart, Jr.',
 'Mr. Alan Lilt',
 'Jon A. Smith, III',
 'Joe Miller',
 'Mika Jennifer Shabosky',
 'Mrs. Angela Calder',
 'Boris Al Bert, Esq.',
 'Dr. Natasha Chorus',
 'Bill Gibbons']
df['fullname'] = [re.sub(r'\s+', ' ', ' '.join(s)).strip(' ,')
                  for s in df.assign(lastname=df.lastname + ',').values.tolist()]
df

  prefix firstname middlename  lastname suffix                fullname
0          Michael               Hobart    Jr.     Michael Hobart, Jr.
1    Mr.      Alan                 Lilt                  Mr. Alan Lilt
2              Jon         A.     Smith    III       Jon A. Smith, III
3              Joe               Miller                     Joe Miller
4             Mika   Jennifer  Shabosky         Mika Jennifer Shabosky
5   Mrs.    Angela               Calder             Mrs. Angela Calder
6            Boris         Al      Bert   Esq.     Boris Al Bert, Esq.
7    Dr.   Natasha               Chorus             Dr. Natasha Chorus
8             Bill              Gibbons                   Bill Gibbons

选项3
pd.replacepd.DataFrame.stack
这个有点不同,因为我们用''替换空白np.nan,这样当我们stacknp.nan自然被删除。这使得' '更加直接加入。

df.assign(
    lastname=df.lastname + ','
).replace('', np.nan).stack().groupby(level=0).apply(' '.join).str.strip(',')

0       Michael Hobart, Jr.
1             Mr. Alan Lilt
2         Jon A. Smith, III
3                Joe Miller
4    Mika Jennifer Shabosky
5        Mrs. Angela Calder
6       Boris Al Bert, Esq.
7        Dr. Natasha Chorus
8              Bill Gibbons
dtype: object
df['fullname'] = df.assign(
    lastname=df.lastname + ','
).replace('', np.nan).stack().groupby(level=0).apply(' '.join).str.strip(',')
df

  prefix firstname middlename  lastname suffix                fullname
0          Michael               Hobart    Jr.     Michael Hobart, Jr.
1    Mr.      Alan                 Lilt                  Mr. Alan Lilt
2              Jon         A.     Smith    III       Jon A. Smith, III
3              Joe               Miller                     Joe Miller
4             Mika   Jennifer  Shabosky         Mika Jennifer Shabosky
5   Mrs.    Angela               Calder             Mrs. Angela Calder
6            Boris         Al      Bert   Esq.     Boris Al Bert, Esq.
7    Dr.   Natasha               Chorus             Dr. Natasha Chorus
8             Bill              Gibbons                   Bill Gibbons

<强>时序
在理解中捆绑是最快的!

%timeit df.assign(fullname=df.replace('', np.nan).stack().groupby(level=0).apply(' '.join))
%timeit df.assign(fullname=df.apply(' '.join, 1).str.replace('\s+', ' ').str.strip())
%timeit df.assign(fullname=[re.sub(r'\s+', ' ', ' '.join(s)).strip() for s in df.values.tolist()])

100 loops, best of 3: 2.51 ms per loop
1000 loops, best of 3: 979 µs per loop
1000 loops, best of 3: 384 µs per loop