Question

使用Ruby，我想找到一个正确识别句子边界的正则表达式，我将其定义为以[。！？]结尾的任何字符串，除非这些标点符号存在于引号内，如

我的朋友说“约翰不在这里！”然后他离开了。

我目前的代码是：

text = para.text.scan(/[^\.!?]+[(?<!(.?!)\"|.!?] /).map(&:strip)

我对正则表达式文档进行了深思熟虑，但仍然无法理解正确的回顾/前瞻。

Answer 1

这样的事情怎么样？

/(?:"(?>[^"]|\\.)+"|[a-z]\.[a-z]\.|[^.?!])+[!.?]/gi

演示：https://regex101.com/r/bJ8hM5/2

工作原理：正则表达式将在字符串中的每个位置检查以下

引用的字符串，形式为＆＃34;引用＆＃34;它可以包含任何直到结束引用。您还可以转发引号，例如"hell\"o"。
匹配任何字母，后跟一个点，然后是另一个字母，最后是一个点。这是为了匹配U.S.等
匹配其他不是标点符号.?!的其他内容。
重复，直到我们达到一个标点字符。

Answer 2

这是一个部分正则表达式解决方案，它忽略了双引号之间包含的句子终结符。

<强>代码

def extract_sentences(str, da_terminators)
  start_with_quote = (str[0] == '"')
  str.split(/(\".*?\")/)
     .flat_map.with_index { |b,i|
       (start_with_quote == i.even?) ? b : b.split(/([#{da_terminators}])/) }
     .slice_after(/^[#{da_terminators}]$/)
     .map { |sb| sb.join.strip }
 end

示例

puts extract_sentences(str, '!?.') # My friend said "John isn't here!", then "I'm outta' here" and then he left. # Let's go! # Later, he said "Aren't you coming?"

<强>解释

对于上面的str和

da_terminators = '!?.'

我们稍后会需要以下内容：

start_with_quote = (str[0] == '"') #=> false

在"..."上拆分字符串。我们需要将\".*?\"设为一个捕获组，以便将其保留在split中。结果是一个数组block，它交替地包含由双引号和其他字符串包围的字符串。 start_with_quote告诉我们哪个是哪个。

blocks = str.split(/(\".*?\")/) #=> ["My friend said ", # "\"John isn't here!\"", # ", then ", # "\"I'm outta' here\"", # " and then he left. Let's go! Later, he said ", # "\"Aren't you coming?\""]

拆分未用双引号括起来的字符串元素。拆分在任何句子终止字符上。同样，它必须位于捕获组中才能保留分隔符。

new_blocks = blocks.flat_map.with_index { |b,i| (start_with_quote == i.even?) ? b : b.split(/([#{da_terminators}])/) } #=> ["My friend said ", # "\"John isn't here!\"", # ", then ", # "\"I'm outta' here\"", # " and then he left", # ".", # " Let's go", # "!", # " Later, he said ", # "\"Aren't you coming?\"" sentence_blocks_enum = new_blocks.slice_after(/^[#{da_terminators}]$/) # #<Enumerator:0x007f9a3b853478>

将此枚举器转换为数组以查看它将传递到其块中的内容：

sentence_blocks_enum.to_a #=> [["My friend said ", # "\"John isn't here!\"", # ", then ", # "\"I'm outta' here\"", # " and then he left", "."], # [" Let's go", "!"], # [" Later, he said ", "\"Aren't you coming?\""]]

合并每个句子的块和strip空格，并返回数组：

sentence_blocks_enum.map { |sb| sb.join.strip } #=> ["My friend said \"John isn't here!\", then \"I'm outta' here\" and then he left.", # "Let's go!", # "Later, he said \"Aren't you coming?\""]

正则表达式前瞻/回顾标点符号模式

2 个答案: