我是Regex的新手。鉴于以下短语,我想摆脱因为使用两个正则表达式操作而出现的I和额外字段。
text= "I have a problem in Regex, How do I get rid of the Capital I's provided I want to retain words occurring together as logical entity with a Capital letter in the beginning of each word like International Business Machine "
例如 我想保留"国际商业机器" as"国际商业机器"而不是" Capital I"作为" Capital I""但是" Capital"
我使用了下面的正则表达式:
re.findall('([A-Z][\w\']*(?:\s+[A-Z][\w|\']*)+)|([A-Z][\w]*)', text)
我收到的输出是
[('', 'I'),
('', 'Regex'),
('', 'How'),
('', 'I'),
("Capital I's", ''),
('', 'I'),
('', 'Capital'),
('International Business Machine', '')]
但是我希望我的输出为:
[('Regex'),
('How'),
("Capital"),
('Capital'),
('International Business Machine')]
如何摆脱"我"由于使用了两个正则表达式操作,出现了额外的字段。
由于
答案 0 :(得分:2)
只需匹配以一个捕获字母后跟一个或多个单词字符开头的单词,然后添加一个模式以匹配以下单词,这些单词应该与上一个单词匹配(以captital letter开头)并使该模式重复零次或多次。这样它就可以匹配Foo
或Foo Bar Buzz
等字符串。
>>> text= "I have a problem in Regex, How do I get rid of the Capital I's provided I want to retain words occurring together as logical entity with a Capital letter in the beginning of each word like International Business Machine "
>>> import re
>>> re.findall(r'\b[A-Z]\w+(?:\s+[A-Z]\w+)*', text)
['Regex', 'How', 'Capital', 'Capital', 'International Business Machine']
答案 1 :(得分:1)
如果您还想匹配撇号(如您的示例中所示),您可以尝试使用:
(?:[A-Z](?:[\w]|(?<=\w\w)\')+\s?)+
如果前面至少有两个单词字符,它将匹配'
。不是太花哨的解决方案,但有效。然后:
import re
text = 'I have a problem in Regex, How do I get rid of the Capital I\'s provided I want to retain words occurring together as logical entity with a Capital letter in the beginning of each word like International Business Machine'
found = re.findall('(?:[A-Z](?:[\w]|(?<=\w\w)\')+\s?)+',text)
print found
也会给出结果:
['Regex', 'How ', 'Capital ', 'Capital ', 'International Business Machine']