从字符串中提取在关键字之前出现的单词/句子 - Python

时间:2018-02-23 18:16:40

标签: python regex keyword matching

我有一个像这样的字符串,

my_str ='·in this match, dated may 1, 2013 (the "the match") is between brooklyn centenniel, resident of detroit, michigan ("champion") and kamil kubaru, the challenger from alexandria, virginia ("underdog").'

现在,我想使用关键字championunderdog提取当前championunderdog

这里真正具有挑战性的是竞争者的名字出现在括号内的关键字之前。我想使用正则表达式并提取信息。

以下就是我做的,

champion = re.findall(r'("champion"[^.]*.)', my_str)
print(champion)

>> ['"champion") and kamil kubaru, the challenger from alexandria, virginia ("underdog").']


underdog = re.findall(r'("underdog"[^.]*.)', my_str)
print(underdog)

>>['"underdog").']

但是,我需要结果champion as

brooklyn centenniel, resident of detroit, michigan

underdog

kamil kubaru, the challenger from alexandria, virginia

如何使用正则表达式执行此操作? (我一直在寻找,如果我可以从关键字返回夫妇或单词以获得我想要的结果,但没有运气)任何帮助或建议将不胜感激。

2 个答案:

答案 0 :(得分:1)

您可以使用命名捕获的组来捕获所需的结果:

between\s+(?P<champion>.*?)\s+\("champion"\)\s+and\s+(?P<underdog>.*?)\s+\("underdog"\)
  • between\s+(?P<champion>.*?)\s+\("champion"\)匹配从between("champion")的块,并将所需的部分放在其中,作为命名的捕获组champion

  • 之后,\s+and\s+(?P<underdog>.*?)\s+\("underdog"\)匹配最高("underdog")的块,并再次从此处获取所需的部分作为命名捕获的组underdog

示例:

In [26]: my_str ='·in this match, dated may 1, 2013 (the "the match") is between brooklyn centenniel, resident of detroit, michigan ("champion") and kamil kubaru, the challenger from alexandria, virginia 
    ...: ("underdog").'

In [27]: out = re.search(r'between\s+(?P<champion>.*?)\s+\("champion"\)\s+and\s+(?P<underdog>.*?)\s+\("underdog"\)', my_str)

In [28]: out.groupdict()
Out[28]: 
{'champion': 'brooklyn centenniel, resident of detroit, michigan',
 'underdog': 'kamil kubaru, the challenger from alexandria, virginia'}

答案 1 :(得分:0)

会有比这更好的答案,我根本不知道正则表达式,但我很无聊,所以这是我的2美分。

以下是我的意思:

words = my_str.split()
index = words.index('("champion")')
champion = words[index - 6:index]
champion = " ".join(champion)

对于失败者,您必须将6更改为7,将'("champion")'更改为'("underdog").'

不确定这是否可以解决您的问题,但是对于这个特定字符串,这在我测试时起作用了。

如果失败者的尾随时间段存在问题,您还可以使用str.strip()删除标点符号。