使用正则表达式python摆脱一些实体

时间:2015-07-26 09:39:45

标签: python regex

我是Regex的新手。鉴于以下短语,我想摆脱因为使用两个正则表达式操作而出现的I和额外字段。

text= "I have a problem in Regex, How do I get rid of the Capital I's provided I want to retain words occurring together as logical entity with a Capital letter in the beginning of each word like International Business Machine "

例如    我想保留"国际商业机器" as"国际商业机器"而不是" Capital I"作为" Capital I""但是" Capital"

我使用了下面的正则表达式:

re.findall('([A-Z][\w\']*(?:\s+[A-Z][\w|\']*)+)|([A-Z][\w]*)', text)  

我收到的输出是

[('', 'I'),
 ('', 'Regex'),
 ('', 'How'),
 ('', 'I'),
 ("Capital I's", ''),
 ('', 'I'),
 ('', 'Capital'),
 ('International Business Machine', '')]

但是我希望我的输出为:

[('Regex'),
 ('How'),
 ("Capital"),
 ('Capital'),
 ('International Business Machine')] 

如何摆脱"我"由于使用了两个正则表达式操作,出现了额外的字段。

由于

2 个答案:

答案 0 :(得分:2)

只需匹配以一个捕获字母后跟一个或多个单词字符开头的单词,然后添加一个模式以匹配以下单词,这些单词应该与上一个单词匹配(以captital letter开头)并使该模式重复零次或多次。这样它就可以匹配FooFoo Bar Buzz等字符串。

>>> text= "I have a problem in Regex, How do I get rid of the Capital I's provided I want to retain words occurring together as logical entity with a Capital letter in the beginning of each word like International Business Machine "
>>> import re
>>> re.findall(r'\b[A-Z]\w+(?:\s+[A-Z]\w+)*', text)
['Regex', 'How', 'Capital', 'Capital', 'International Business Machine']

答案 1 :(得分:1)

如果您还想匹配撇号(如您的示例中所示),您可以尝试使用:

(?:[A-Z](?:[\w]|(?<=\w\w)\')+\s?)+

DEMO

如果前面至少有两个单词字符,它将匹配'。不是太花哨的解决方案,但有效。然后:

import re
text = 'I have a problem in Regex, How do I get rid of the Capital I\'s provided I want to retain words occurring together as logical entity with a Capital letter in the beginning of each word like International Business Machine'
found = re.findall('(?:[A-Z](?:[\w]|(?<=\w\w)\')+\s?)+',text)
print found

也会给出结果:

['Regex', 'How ', 'Capital ', 'Capital ', 'International Business Machine']