我试图在2个html载入文本之间产生类似人类可读的维基。我使用diff-lcs,第一步是将字符串(字符数组)分成一个句子数组,但保留它们的标点符号。
"I am a lion. Hear me roar! Where is my cub? Never mind, found him.".magic_split(/[.?!]/)
# => "I am a lion." "Hear me roar!" "Where is my cub?" "Never mind, found him."
这应该可以解决问题
"I am a lion. Hear me roar! Where is my cub? Never mind, found him.".gsub(/[.?!]/, '\1|').split('|')
除了gsub似乎无法插入字符.?!
。相反,它会返回此
"I am a lion| Hear me roar| Where is my cub| Never mind, found him|"
进行非破坏性拆分的最简单方法是什么?因为它保留了它所分割的角色。
答案 0 :(得分:13)
scan
应该这样做(在那里抛出strip
以摆脱尾随空格)。
s = "I am a lion. Hear me roar! Where is my cub? Never mind, found him."
s.scan(/[^\.!?]+[\.!?]/).map(&:strip) # => ["I am a lion.", "Hear me roar!", "Where is my cub?", "Never mind, found him."]
答案 1 :(得分:3)
我认为应该是\0
>> string = "I am a lion. Hear me roar! Where is my cub? Never mind, found him."
>> string.gsub(/[.?!]/, '\0|')
# "I am a lion.| Hear me roar!| Where is my cub?| Never mind, found him.|"