在Pandas中向字符串拆分命令添加函数

时间:2017-05-27 03:17:10

标签: python pandas

我有一个数据框,其中包含20个左右的列。其中一列称为'director_name',其值为'John Doe'或'Jane Doe'。我想将其拆分为2列,'First_Name'和'Last_Name'。当我运行以下操作时,它按预期工作,并将字符串拆分为2列:

data[['First_Name', 'Last_Name']] = data.director_name.str.split(' ', expand 
= True) 
data

First_Name    Last_Name
John          Doe

它工作得很好,但是当我在'director_name'下有NULL(NaN)值时它不起作用。它会引发以下错误:

'Columns must be same length as key'

我想添加一个检查值是否为!= null的函数,然后执行上面列出的命令,否则为First_Name和'Last_Name'输入'NA'

有什么想法我会怎么做?

编辑:

我刚检查了文件,我不确定NULL是否是问题。我有一些长3-4个字符串的名字。即。

John Allen Doe
John Allen Doe Jr

也许我无法将其拆分为First_Name和Last_Name。

Hmmmm

4 个答案:

答案 0 :(得分:7)

这是一种方法是拆分并选择说出前两个值作为名字和姓氏

    Id  name
0   1   James Cameron
1   2   Martin Sheen
2   3   John Allen Doe
3   4   NaN


df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]

你得到了

    Id  name            First_Name  Last_Name
0   1   James Cameron   James       Cameron
1   2   Martin Sheen    Martin      Sheen
2   3   John Allen Doe  John        Allen
3   4   NaN             NaN         None

答案 1 :(得分:2)

按位置使用str.split(无参数,因为默认情况下拆分器为空格)和indexing with str用于选择列表:

print (df.name.str.split())
0      [James, Cameron]
1       [Martin, Sheen]
2    [John, Allen, Doe]
3                   NaN
Name: name, dtype: object

df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split().str[1]

#data borrow from A-Za-z answer
print (df)
   Id            name First_Name Last_Name
0   1   James Cameron      James   Cameron
1   2    Martin Sheen     Martin     Sheen
2   3  John Allen Doe       John     Allen
3   4             NaN        NaN       NaN

还可以使用参数n来选择第二个或前两个名称:

df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split(n=1).str[1]
print (df)
   Id            name First_Name  Last_Name
0   1   James Cameron      James    Cameron
1   2    Martin Sheen     Martin      Sheen
2   3  John Allen Doe       John  Allen Doe
3   4             NaN        NaN        NaN

str.rstrip

的解决方案
df['First_Name'] = df.name.str.rsplit(n=1).str[0]
df['Last_Name'] = df.name.str.rsplit().str[-1]
print (df)
   Id            name  First_Name Last_Name
0   1   James Cameron       James   Cameron
1   2    Martin Sheen      Martin     Sheen
2   3  John Allen Doe  John Allen       Doe
3   4             NaN         NaN       NaN

答案 2 :(得分:1)

这应该可以解决您的问题

<强>设置

data= pd.DataFrame({'director_name': {0: 'John Doe', 1: np.nan, 2: 'Alan Smith'}})

data
Out[457]: 
  director_name
0      John Doe
1           NaN
2    Alan Smith

<强>解决方案

#use a lambda function to check nan before splitting the column.
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()), axis=1)

data
Out[446]: 
  director_name First_Name Last_Name
0      John Doe       John       Doe
1           NaN        NaN       NaN
2    Alan Smith       Alan     Smith

如果您只需要前两个名字,您可以这样做:

data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()).iloc[:2], axis=1)

答案 3 :(得分:1)

df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]

这应该