python pandas用空格分隔两个字符串列

时间:2017-03-27 19:00:42

标签: python pandas replace extract

我有一个python pandas dataframe df,其中包含以下列" title":

title
This is the first title XY2547
This is the second title WWW48921
This is the third title  A2438999
This is another title 123 

我需要将此列分为两部分,实际标题和不规则代码。有没有办法用空格后面的最后一个字分割它? 请注意,最后一个标题没有代码,123是标题的一部分。

结束目标DF

title                             |  cleaned title            | code
This is the first title XY2547       This is the first title    XY2547
This is the second title WWW48921    This is the second title   WWW48921
This is the third title  A2438999    This is the third title    A2438999
This is another title 123            This is another title 123

我在考虑像

这样的东西
df['code'] = df.title.str.extract(r'_\s(\w)', expand=False)

这不起作用。

谢谢

2 个答案:

答案 0 :(得分:3)

试试这个:

In [62]: df
Out[62]:
                               title
0     This is the first title XY2547
1  This is the second title WWW48921
2  This is the third title  A2438999
3         This is another title 123

In [63]: df[['cleaned_title', 'code']] = \
    ...:     df.title.str.extract(r'(.*?)\s+([A-Z]{1,}\d{3,})?$', expand=True)

In [64]: df
Out[64]:
                               title              cleaned_title      code
0     This is the first title XY2547    This is the first title    XY2547
1  This is the second title WWW48921   This is the second title  WWW48921
2  This is the third title  A2438999    This is the third title  A2438999
3         This is another title 123   This is another title 123       NaN

答案 1 :(得分:1)

解决方案#1

str.rsplit可以在这里使用。它从字符串的右边开始分割n次。

然后我们可以使用join

df结果
df.join(
    df.title.str.rsplit(n=1, expand=True).rename(
        columns={0: 'cleaned title', 1: 'code'}
    )
)

                               title             cleaned title      code
0     This is the first title XY2547   This is the first title    XY2547
1  This is the second title WWW48921  This is the second title  WWW48921
2  This is the third title  A2438999   This is the third title  A2438999
3         This is another title 123      This is another title       123

解决方案#2

为避免将123解释为代码,您必须应用一些未提供的其他逻辑。 @MaxU非常慷慨地将他的逻辑嵌入到正则表达式中。

我的regex解决方案看起来像这样 计划

  • 使用'?P<name>'命名生成的列
  • 仅匹配大写字母和'[A-Z0-9]'
  • 的任何数字
  • 确保'{4,}'
  • 有4个或更多
  • 从开头'^'到结尾'$'
  • 匹配
  • 确保'.*'
  • 确保'.*?'不贪心
regex = '^(?P<cleaned_title>.*?)\s*(?P<code>[A-Z0-9]{4,})?$'
df.join(df.title.str.extract(regex, expand=True))

                               title              cleaned_title      code
0     This is the first title XY2547    This is the first title    XY2547
1  This is the second title WWW48921   This is the second title  WWW48921
2  This is the third title  A2438999    This is the third title  A2438999
3          This is another title 123  This is another title 123       NaN