从段落中解析单个句子

时间:2015-09-20 23:46:54

标签: scala parsing nlp

我正在尝试构建一个可以将段落转换为句子列表的解析器,但我遇到了一个主要问题。所以我使用stanford解析器智能地提取句子,但问题是解析器只存储令牌列表,而不是句子本身。如果我的客户想要之前显示的文本(包括之前的任何间距),这可能会变得非常有问题。

有没有人对如何解决这个问题有任何建议?

def prepSentenceStrings(text: String): List[String] = {
     val mod = text.replace("Sr.", "Sr") // deals with an edge case
     val doc = new DocumentPreprocessor(new StringReader(mod))
     doc.map(x => reconfigureSentence(Sentence.listToString(x))).toList}


def reconfigureSentence(text:String) :String = {
    text.replace(" .", ".").replace(" ,",",").replace(" !", "!").replace("( ","(").replace("< ", "<").replace(" )", ")")

}

2 个答案:

答案 0 :(得分:1)

使用斯坦福NLP执行句子拆分的问题在于它首先将整个段落标记化并删除过程中的所有空白字符。据我所知,没有办法重建它们,总是存在一个风险,你最终得到一个略有改变的句子。

你必须使用Scala来完成你的任务吗?已经有很好的句子拆分解决方案,比如在Perl中实现的Sentence Segmentation Tool。我已经使用过这个工具几次,并且对输出非常满意。也许您可以从Scala程序中调用它然后处理结果?

Here您可以找到有关不同句子分割器及其工作原理的概述。

答案 1 :(得分:0)

您可以使用Epic librarySentenceSegmenter,为了方便起见,它还带有一个主方法。否则,它只需要一个字符串并返回IndexedSeq[String],每个句子一个。空白被保留。如果你想要char-offsets,你可以看看Epic的Slab数据结构,它可以用于那个目的。

fukaeri:epic dlwh (master)$ java -Xmx8g -cp target/scala-2.11/epic-assembly-0.4-SNAPSHOT.jar epic.preprocess.SegmentSentences
fukaeri:epic dlwh (master)$ vi qq.txt
fukaeri:epic dlwh (master)$ cat qq.txt
I'm trying to build a parser that can turn a          paragraph into a list of sentences, but I'm running into a major problem. So I'm using the stanford parser to pull out the sentences intelligently, but the issue is that the parser only stores the list of tokens, rather than the sentence itself. This can become very problematic if my client wants the text EXACTLY as it showed up before (including any spacing that was there before.
fukaeri:epic dlwh (master)$ java -Xmx8g -cp target/scala-2.11/epic-assembly-0.4-SNAPSHOT.jar epic.preprocess.SegmentSentences < qq.txt
I'm trying to build a parser that can turn a          paragraph into a list of sentences, but I'm running into a major problem.
So I'm using the stanford parser to pull out the sentences intelligently, but the issue is that the parser only stores the list of tokens, rather than the sentence itself.
This can become very problematic if my client wants the text EXACTLY as it showed up before (including any spacing that was there before.