我正试图从一个段落中提取句子,其格式如
Current. time is six thirty at Scotland. Past. time was five thirty at India; Current. time is five thirty at Scotland. Past. time was five thirty at Scotland. Current. time is five ten at Scotland.
当我使用正则表达式
时/current\..*scotland\./i
这匹配所有字符串
Current. time is six thirty at Scotland. Past. time was six thirty at India; Current. time is five thirty at Scotland. Past. time was five thirty at Scotland. Current. time is five ten at Scotland.
相反,我想在第一次出现"时停止。"对所有捕获组如
Current. time is six thirty at Scotland.
Current. time is five ten at Scotland.
类似于
之类的文字 Past. time was five thirty at India; Current. time is six thirty at Scotland. Past. time was five thirty at Scotland. Past. time was five ten at India;
当我使用正则表达式
时 /past\..*india\;/i
此匹配将整个字符串
Past. time was five thirty at India; Current. time is six thirty at Scotland. Past. time was five thirty at Scotland. Past. time was five ten at India;
在这里,我想捕捉所有群组或第一组如下,以及如何在第一次出现时停止";"
Past. time was five thirty at India;
Past. time was five ten at India;
如何让正则表达式停留在","或";"以上例子?
答案 0 :(得分:12)
有一些你不应该用你的正则表达式做的事情,首先,正如Arnal Murali指出的那样,你不应该使用贪婪的正则表达式但是应该使用懒惰的版本:
/current\..*?scotland\./i
我认为首先使用正则表达式是一种常规规则,因为它通常是你想要的。其次,您真的不想使用.
来匹配所有内容,因为您不希望允许正则表达式的这一部分与.
或;
匹配,您可以将其放入一个负捕获组来捕获除它们之外的任何东西:
/current\.[^.]*?scotland\./i
和
/current\.[^;]*?india;/i
或覆盖两者:
/(current|past)\.[^.;]*?(india|scotland)[.;]/i
(显然这可能不是你想要做的,只是包括演示如何扩展它)
这也是一个很好的经验法则,如果您在使用正则表达式时遇到问题,请将任何通配符更具体(在这种情况下,从匹配所有内容.
更改为匹配除.
和{;
之外的所有内容{1}}与[^.;]
)
答案 1 :(得分:3)
正如Amal所说,你的模式是贪婪的,你应该附加一个?让它变得懒惰。我将使用以下内容来获取您要求的第一个字符串:
/^.*?current\..*?scotland\./i
这样可以让每个小组都遵循这种模式,同时考虑到';'以及'。':
/current\..*?scotland[.;]/i
这最后一个基本上意味着:找到任何'当前'的出现,当你到达第一个'苏格兰'后跟'a'时停止。或者';'
答案 2 :(得分:3)
s = ""Current. time is six thirty at Scotland. Past. time..."
s.scan /[Current|Past]*\..*?[.|;]/i
#=> ["Current. time is six thirty at Scotland.", "Past. time was five thirty at India;",...]