使用Ruby,我想找到一个正确识别句子边界的正则表达式,我将其定义为以[。!?]结尾的任何字符串,除非这些标点符号存在于引号内,如
我的朋友说“约翰不在这里!”然后他离开了。
我目前的代码是:
text = para.text.scan(/[^\.!?]+[(?<!(.?!)\"|.!?] /).map(&:strip)
我对正则表达式文档进行了深思熟虑,但仍然无法理解正确的回顾/前瞻。
答案 0 :(得分:2)
这样的事情怎么样?
/(?:"(?>[^"]|\\.)+"|[a-z]\.[a-z]\.|[^.?!])+[!.?]/gi
演示:https://regex101.com/r/bJ8hM5/2
工作原理: 正则表达式将在字符串中的每个位置检查以下
"hell\"o"
。U.S.
等.?!
的其他内容。答案 1 :(得分:1)
这是一个部分正则表达式解决方案,它忽略了双引号之间包含的句子终结符。
<强>代码强>
def extract_sentences(str, da_terminators)
start_with_quote = (str[0] == '"')
str.split(/(\".*?\")/)
.flat_map.with_index { |b,i|
(start_with_quote == i.even?) ? b : b.split(/([#{da_terminators}])/) }
.slice_after(/^[#{da_terminators}]$/)
.map { |sb| sb.join.strip }
end
示例强>
puts extract_sentences(str, '!?.')
# My friend said "John isn't here!", then "I'm outta' here" and then he left.
# Let's go!
# Later, he said "Aren't you coming?"
<强>解释强>
对于上面的str
和
da_terminators = '!?.'
我们稍后会需要以下内容:
start_with_quote = (str[0] == '"')
#=> false
在"..."
上拆分字符串。我们需要将\".*?\"
设为一个捕获组,以便将其保留在split
中。结果是一个数组block
,它交替地包含由双引号和其他字符串包围的字符串。 start_with_quote
告诉我们哪个是哪个。
blocks = str.split(/(\".*?\")/)
#=> ["My friend said ",
# "\"John isn't here!\"",
# ", then ",
# "\"I'm outta' here\"",
# " and then he left. Let's go! Later, he said ",
# "\"Aren't you coming?\""]
拆分未用双引号括起来的字符串元素。拆分在任何句子终止字符上。同样,它必须位于捕获组中才能保留分隔符。
new_blocks = blocks.flat_map.with_index { |b,i|
(start_with_quote == i.even?) ? b : b.split(/([#{da_terminators}])/) }
#=> ["My friend said ",
# "\"John isn't here!\"",
# ", then ",
# "\"I'm outta' here\"",
# " and then he left",
# ".",
# " Let's go",
# "!",
# " Later, he said ",
# "\"Aren't you coming?\""
sentence_blocks_enum = new_blocks.slice_after(/^[#{da_terminators}]$/)
# #<Enumerator:0x007f9a3b853478>
将此枚举器转换为数组以查看它将传递到其块中的内容:
sentence_blocks_enum.to_a
#=> [["My friend said ",
# "\"John isn't here!\"",
# ", then ",
# "\"I'm outta' here\"",
# " and then he left", "."],
# [" Let's go", "!"],
# [" Later, he said ", "\"Aren't you coming?\""]]
合并每个句子的块和strip
空格,并返回数组:
sentence_blocks_enum.map { |sb| sb.join.strip }
#=> ["My friend said \"John isn't here!\", then \"I'm outta' here\" and then he left.",
# "Let's go!",
# "Later, he said \"Aren't you coming?\""]