Question

按照书中的教程，使用以下代码将文本拆分成句子，

def sentences
    gsub(/\n|\r/, ' ').split(/\.\s*/)
end

它可以工作，但是如果是一个新行，它开始时没有一段时间，例如，

Hello. two line sentence
and heres the new line

在每个句子的开头放置一个“\ t”。所以，如果我在上面的句子中调用方法，我会得到

["Hello." "two line sentence /tand heres the new line"]

任何帮助将不胜感激！谢谢！

Answer 1

使用Stanford CoreNLP可以最好地将文本拆分成句子。在问题中提供的示例方法中，任何首字母缩略词或名称前缀，例如＆＃34; Mr。＆＃34;也会分裂。

stanford-core-nlp ruby gem提供ruby界面。请参阅installing the gem and Stanford CoreNLP in this answer的说明，然后您可以编写如下代码：

require "stanford-core-nlp"

StanfordCoreNLP.use :english
StanfordCoreNLP.model_files = {}
StanfordCoreNLP.default_jars = [
  'joda-time.jar',
  'xom.jar',
  'stanford-corenlp-3.5.0.jar',
  'stanford-corenlp-3.5.0-models.jar',
  'jollyday.jar',
  'bridge.jar'
]

pipeline =  StanfordCoreNLP.load(:tokenize, :ssplit)

text = 'Hello. two line sentence
and heres the new line'
text = StanfordCoreNLP::Annotation.new(text)
pipeline.annotate(text)
text.get(:sentences).each{|s| puts "sentence: " + s.to_s}

#output:
#sentence: Hello.
#sentence: two line sentence
#and heres the new line

Ruby，将文本分成句子

1 个答案: