用于删除实体名称的正则表达式

时间:2016-01-04 16:31:51

标签: python regex

鉴于以下推文:

Brick Brewing Co Limited (BRB) Downgraded by Cormark to Market Perform

Brinker International Inc (EAT) Upgraded by Zacks Investment Research to Hold

如何编写删除"by Cormark""by Zacks Investment Research"

的正则表达式

我试过了:

"by ([A-Za-z ]+\w to)"

使用python但它需要单词“to”。我希望正则表达式在捕获“to”之前停止。

如果有人能告诉我如何编写捕获驼峰案例的正则表达式,如"Zacks Investment Research",那也很有趣。

3 个答案:

答案 0 :(得分:3)

您可以使用positive look-ahead排除单词to

>>> s1 = "Brick Brewing Co Limited (BRB) Downgraded by Cormark to Market Perform"
>>> 
>>> s2 = "Brinker International Inc (EAT) Upgraded by Zacks Investment Research to Hold"
>>> 
>>> import re
>>> re.sub(r'by[\w\s]+(?=to)','',s1)
'Brick Brewing Co Limited (BRB) Downgraded to Market Perform'
>>> re.sub(r'by[\w\s]+(?=to)','',s2)
'Brinker International Inc (EAT) Upgraded to Hold'
>>> 

请注意,正则表达式[\w\s]+将匹配单词字符和空格的任意组合。如果您只想匹配字母字符和空格,可以使用[a-z\s] re.I标记(忽略大小写)。

答案 1 :(得分:2)

要删除by之后的所有大写单词,您可以使用

by [A-Z][a-z]*(?: +[A-Z][a-z]*)*

请参阅regex demo

解释

  • by - 包含3个字符by和空格的文字序列
  • [A-Z][a-z]* - 大写单词(一个大写后跟零个或多个小写字母)
  • (?: +[A-Z][a-z]*)* - 零个或多个序列......
    • +[A-Z][a-z]* - 一个或多个空格后跟一个大写字母,后跟零个或多个小写字母。

可以在模式中用\s替换常规空间以匹配任何空格。另外,要匹配CaMeL字词,您可以将所有[a-z]替换为[a-zA-Z]

答案 2 :(得分:0)

您也可以使用str方法index执行此操作,然后切片并添加:

>>> def remove_name(s):
        b = s.index(' by ')
        t = s.index(' to ')
        s = s[:b]+s[t:]
        return s
>>> 
>>> s = 'Brick Brewing Co Limited (BRB) Downgraded by Cormark to Market Perform'
>>> remove_name(s)
'Brick Brewing Co Limited (BRB) Downgraded to Market Perform'
>>> 
>>> s = "Brinker International Inc (EAT) Upgraded by Zacks Investment Research to Hold"
>>> remove_name(s)
'Brinker International Inc (EAT) Upgraded to Hold'