Question

我是Regex的新手。鉴于以下短语，我想摆脱因为使用两个正则表达式操作而出现的I和额外字段。

text= "I have a problem in Regex, How do I get rid of the Capital I's provided I want to retain words occurring together as logical entity with a Capital letter in the beginning of each word like International Business Machine "

例如我想保留＆＃34;国际商业机器＆＃34; as＆＃34;国际商业机器＆＃34;而不是＆＃34; Capital I＆＃34;作为＆＃34; Capital I＆＃34;＆＃34;但是＆＃34; Capital＆＃34;

我使用了下面的正则表达式：

re.findall('([A-Z][\w\']*(?:\s+[A-Z][\w|\']*)+)|([A-Z][\w]*)', text)

我收到的输出是

[('', 'I'),
 ('', 'Regex'),
 ('', 'How'),
 ('', 'I'),
 ("Capital I's", ''),
 ('', 'I'),
 ('', 'Capital'),
 ('International Business Machine', '')]

但是我希望我的输出为：

[('Regex'),
 ('How'),
 ("Capital"),
 ('Capital'),
 ('International Business Machine')]

如何摆脱＆＃34;我＆＃34;由于使用了两个正则表达式操作，出现了额外的字段。

由于

Answer 1

只需匹配以一个捕获字母后跟一个或多个单词字符开头的单词，然后添加一个模式以匹配以下单词，这些单词应该与上一个单词匹配（以captital letter开头）并使该模式重复零次或多次。这样它就可以匹配Foo或Foo Bar Buzz等字符串。

>>> text= "I have a problem in Regex, How do I get rid of the Capital I's provided I want to retain words occurring together as logical entity with a Capital letter in the beginning of each word like International Business Machine "
>>> import re
>>> re.findall(r'\b[A-Z]\w+(?:\s+[A-Z]\w+)*', text)
['Regex', 'How', 'Capital', 'Capital', 'International Business Machine']

Answer 2

如果您还想匹配撇号（如您的示例中所示），您可以尝试使用：

(?:[A-Z](?:[\w]|(?<=\w\w)\')+\s?)+

DEMO

如果前面至少有两个单词字符，它将匹配'。不是太花哨的解决方案，但有效。然后：

import re
text = 'I have a problem in Regex, How do I get rid of the Capital I\'s provided I want to retain words occurring together as logical entity with a Capital letter in the beginning of each word like International Business Machine'
found = re.findall('(?:[A-Z](?:[\w]|(?<=\w\w)\')+\s?)+',text)
print found

也会给出结果：

['Regex', 'How ', 'Capital ', 'Capital ', 'International Business Machine']

使用正则表达式python摆脱一些实体

2 个答案: