我有一个类似下面的python代码来搜索所有英文名称:
a = "Bonds met Susann ("Sun") Margreth Branco, the mother of his first two children, in {{city-state|Montreal|Quebec}} in August 1987. They eloped in {{city-state|Las Vegas|Nevada}} Barry Bonds"
re.findall("(?:[A-Z][a-z'.]+\s*){1,4}",a)
我希望它返回:
['Bonds', 'Susann ("Sun") Margreth Branco', 'Montreal', 'Quebec', 'August', 'They', 'Las Vegas','Nevada','Barry Bonds']
我的代码无法得到我想要的,如何修改正则表达式以实现我的目标?
我想补充说我使用了另一个正则表达式(?:(([A-Z][a-z'.]+)|(\(".*"\)))\s*){1,4}
。我在regexpal.com上测试它,它在测试网站上找到了我想要的东西,但在Python中,它只是没有返回我想要的东西,但返回给我Susan
和("Sun") Margreth
和{ {1}},三个单独,但我想在我的结果中使用Branco
答案 0 :(得分:1)
正如您所提到的,带有“& quto”的字符串也被视为分隔符:
re.findall("[A-Z][a-z]*(?:(?:\\S*"\\S*|\\s)+[A-Z][a-z]*){0,3}", "Bonds met Susann ("Sun") Margreth Branco, the mother of his first two children, in {{city-state|Montreal|Quebec}} in August 1987. They eloped in {{city-state|Las Vegas|Nevada}} Barry Bonds")
输出:
['Bonds', 'Susann ("Sun") Margreth Branco', 'Montreal', 'Quebec', 'August', 'They', 'Las Vegas', 'Nevada', 'Barry Bonds']