我有一个python pandas dataframe df,其中包含以下列" title":
title
This is the first title XY2547
This is the second title WWW48921
This is the third title A2438999
This is another title 123
我需要将此列分为两部分,实际标题和不规则代码。有没有办法用空格后面的最后一个字分割它? 请注意,最后一个标题没有代码,123是标题的一部分。
结束目标DF
title | cleaned title | code
This is the first title XY2547 This is the first title XY2547
This is the second title WWW48921 This is the second title WWW48921
This is the third title A2438999 This is the third title A2438999
This is another title 123 This is another title 123
我在考虑像
这样的东西df['code'] = df.title.str.extract(r'_\s(\w)', expand=False)
这不起作用。
谢谢
答案 0 :(得分:3)
试试这个:
In [62]: df
Out[62]:
title
0 This is the first title XY2547
1 This is the second title WWW48921
2 This is the third title A2438999
3 This is another title 123
In [63]: df[['cleaned_title', 'code']] = \
...: df.title.str.extract(r'(.*?)\s+([A-Z]{1,}\d{3,})?$', expand=True)
In [64]: df
Out[64]:
title cleaned_title code
0 This is the first title XY2547 This is the first title XY2547
1 This is the second title WWW48921 This is the second title WWW48921
2 This is the third title A2438999 This is the third title A2438999
3 This is another title 123 This is another title 123 NaN
答案 1 :(得分:1)
#1
str.rsplit
可以在这里使用。它从字符串的右边开始分割n
次。
然后我们可以使用join
df
结果
df.join(
df.title.str.rsplit(n=1, expand=True).rename(
columns={0: 'cleaned title', 1: 'code'}
)
)
title cleaned title code
0 This is the first title XY2547 This is the first title XY2547
1 This is the second title WWW48921 This is the second title WWW48921
2 This is the third title A2438999 This is the third title A2438999
3 This is another title 123 This is another title 123
#2
为避免将123
解释为代码,您必须应用一些未提供的其他逻辑。 @MaxU非常慷慨地将他的逻辑嵌入到正则表达式中。
我的regex
解决方案看起来像这样
计划
'?P<name>'
命名生成的列'[A-Z0-9]'
'{4,}'
'^'
到结尾'$'
'.*'
'.*?'
不贪心
regex = '^(?P<cleaned_title>.*?)\s*(?P<code>[A-Z0-9]{4,})?$'
df.join(df.title.str.extract(regex, expand=True))
title cleaned_title code
0 This is the first title XY2547 This is the first title XY2547
1 This is the second title WWW48921 This is the second title WWW48921
2 This is the third title A2438999 This is the third title A2438999
3 This is another title 123 This is another title 123 NaN