Pythonic句子分裂在以大写字母开头的单词上

时间:2014-04-16 21:40:59

标签: python string split

我在UTF中有几个句子,我希望根据第一个大写字母进行拆分。

示例:

"Tough Fox" -> "Tough", "Fox"

"Nice White Cat" -> "Nice", "White Cat"

"This is a lazy Dog" -> "This is a lazy", "Dog"

"This is hardworking Little Ant" -> "This is hardworking", "Little Ant"

什么是pythonic方式进行这种分裂?

4 个答案:

答案 0 :(得分:3)

我会用re:

>>> import re
>>> l = ["Tough Fox", "Nice White Cat", "This is a lazy Dog" ]
>>> for i in l:
...   print re.findall("[A-Z][^A-Z]*", i)
... 
['Tough ', 'Fox']
['Nice ', 'White ', 'Cat']
['This is a lazy ', 'Dog']

编辑: 好吧,我认为这是一个错误。所以现在我有点迟了,re.split(..., s, maxsplit=1)是最好的方式,但你仍然可以在没有maxsplit的情况下做到这一点:

>>> for i in l:
...   print re.findall("^[^ ]*|[A-Z].*", i)
... 
['Tough', 'Fox']
['Nice', 'White Cat']
['This', 'Dog']

答案 1 :(得分:3)

如果要在空格后面的每个大写字母上拆分字符串

import re

s = "Tough Fox"
re.split(r"\s(?=[A-Z])", s, maxsplit=1)

['Tough', 'Fox']

re.split方法等同于Python内置str.split,但允许将regular expression用作拆分模式。

正则表达式首先查找空格(\s)作为拆分模式。这种模式将被re.split操作吃掉。

(?=...)部分讲述的是预读谓词表达式。字符串中的下一个字符必须与此谓词匹配(在本例中为任何大写字母[A-Z])。但是,此部分不会被视为匹配的一部分,因此re.split操作不会被其占用。

maxsplit=1将确保只发生一次拆分(最多两项)。

答案 2 :(得分:1)

也许是这样的:

In [1]: import re

In [2]: def split(s):
   ...:     return re.split(r'\W(?=[A-Z])', s, 1)
   ...:

In [3]: l = ["Tough Fox", "Nice White Cat", "This is a lazy Dog" ]

In [4]: for s in l:
   ...:     print(split(s))
   ...:
['Tough', 'Fox']
['Nice', 'White Cat']
['This is a lazy', 'Dog']

答案 3 :(得分:1)

使用re.split()限制:

 space_split = re.compile(r'\s+(?=[A-Z])')
 result = space_split.split(inputstring, 1)

演示:

>>> import re
>>> space_split = re.compile(r'\s+(?=[A-Z])')
>>> l = ["Tough Fox", "Nice White Cat", "This is a lazy Dog" ]
>>> for i in l:
...     print space_split.split(i, 1)
... 
['Tough', 'Fox']
['Nice', 'White Cat']
['This is a lazy', 'Dog']