正则表达式前瞻/回顾标点符号模式

时间:2015-01-31 18:16:36

标签: ruby regex

使用Ruby,我想找到一个正确识别句子边界的正则表达式,我将其定义为以[。!?]结尾的任何字符串,除非这些标点符号存在于引号内,如

  

我的朋友说“约翰不在这里!”然后他离开了。

我目前的代码是:

text = para.text.scan(/[^\.!?]+[(?<!(.?!)\"|.!?] /).map(&:strip)

我对正则表达式文档进行了深思熟虑,但仍然无法理解正确的回顾/前瞻。

2 个答案:

答案 0 :(得分:2)

这样的事情怎么样?

/(?:"(?>[^"]|\\.)+"|[a-z]\.[a-z]\.|[^.?!])+[!.?]/gi

演示:https://regex101.com/r/bJ8hM5/2

工作原理: 正则表达式将在字符串中的每个位置检查以下

  1. 引用的字符串,形式为&#34;引用&#34;它可以包含任何直到结束引用。您还可以转发引号,例如"hell\"o"
  2. 匹配任何字母,后跟一个点,然后是另一个字母,最后是一个点。这是为了匹配U.S.
  3. 的特殊情况
  4. 匹配其他不是标点符号.?!的其他内容。
  5. 重复,直到我们达到一个标点字符。

答案 1 :(得分:1)

这是一个部分正则表达式解决方案,它忽略了双引号之间包含的句子终结符。

<强>代码

def extract_sentences(str, da_terminators)
  start_with_quote = (str[0] == '"')
  str.split(/(\".*?\")/)
     .flat_map.with_index { |b,i|
       (start_with_quote == i.even?) ? b : b.split(/([#{da_terminators}])/) }
     .slice_after(/^[#{da_terminators}]$/)
     .map { |sb| sb.join.strip }
 end

示例

puts extract_sentences(str, '!?.')
  # My friend said "John isn't here!", then "I'm outta' here" and then he left.
  # Let's go!
  # Later, he said "Aren't you coming?"

<强>解释

对于上面的str

da_terminators = '!?.'

我们稍后会需要以下内容:

start_with_quote = (str[0] == '"')
  #=> false

"..."上拆分字符串。我们需要将\".*?\"设为一个捕获组,以便将其保留在split中。结果是一个数组block,它交替地包含由双引号和其他字符串包围的字符串。 start_with_quote告诉我们哪个是哪个。

blocks = str.split(/(\".*?\")/)
  #=> ["My friend said ",
  #    "\"John isn't here!\"",
  #    ", then ",
  #    "\"I'm outta' here\"",
  #    " and then he left. Let's go! Later, he said ",
  #    "\"Aren't you coming?\""]

拆分未用双引号括起来的字符串元素。拆分在任何句子终止字符上。同样,它必须位于捕获组中才能保留分隔符。

new_blocks = blocks.flat_map.with_index { |b,i|
  (start_with_quote == i.even?) ? b : b.split(/([#{da_terminators}])/) }
  #=> ["My friend said ",
  #    "\"John isn't here!\"",
  #    ", then ",
  #    "\"I'm outta' here\"",
  #    " and then he left",
  #    ".",
  #    " Let's go",
  #    "!",
  #    " Later, he said ",
  #    "\"Aren't you coming?\""

sentence_blocks_enum = new_blocks.slice_after(/^[#{da_terminators}]$/)
  # #<Enumerator:0x007f9a3b853478>

将此枚举器转换为数组以查看它将传递到其块中的内容:

sentence_blocks_enum.to_a
  #=> [["My friend said ",
  #     "\"John isn't here!\"",
  #     ", then ",
  #     "\"I'm outta' here\"",
  #     " and then he left", "."],
  #    [" Let's go", "!"],
  #    [" Later, he said ", "\"Aren't you coming?\""]] 

合并每个句子的块和strip空格,并返回数组:

sentence_blocks_enum.map { |sb| sb.join.strip }
  #=> ["My friend said \"John isn't here!\", then \"I'm outta' here\" and then he left.",
  #    "Let's go!",
  #    "Later, he said \"Aren't you coming?\""]