为什么Java的BreakIterator会在我的文本中添加额外的逗号?

时间:2014-05-22 15:09:08

标签: java text nlp locale

我正在使用Java的BreakIterator课程来打破各种语言的文本段落。它工作得很好,但由于某些原因,它在文本中添加逗号,之前它们不在那里。

看起来它增加了:

, ,

到段落符在原始文本中的文本。由于某种原因,它还会在其他逗号之前添加逗号。

以下是我得到的结果类型的示例

  

首先,我必须起床,我的火车在五点钟离开

     

     

&#34 ;,他看着闹钟,抽屉的胸口滴答作响。

     

"上帝在天堂!"

文字看起来应该更像这样:

First of all though, I've got to get up, my train leaves at five.
And he looked over at the alarm clock, ticking on the chest of drawers.
"God in Heaven!" he thought.

这是原始段落:

First of all though, I've got to get up,
my train leaves at five."

And he looked over at the alarm clock, ticking on the chest of
drawers.  "God in Heaven!" he thought.

我得到了我需要完成的大部分工作,但在将文本分解为句子并手动编辑所有额外逗号后,我仍然需要返回。

正如您可能想象的那样,搜索" java breakiterator额外的逗号"还没有给我很多有用的结果。

以下是我用来执行句子检测的功能。

public ArrayList<String> tokenize(String text, Locale locale)
{
    ArrayList<String> sentences = new ArrayList<String>();
    BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(locale);
    sentenceIterator.setText(text);
    int boundary = sentenceIterator.first();
    int lastBoundary = 0;
    while (boundary != BreakIterator.DONE)
    {
        boundary = sentenceIterator.next();
        if(boundary != BreakIterator.DONE)
        {
            sentences.add(text.substring(lastBoundary, boundary));
        }
        lastBoundary = boundary;
    }
    return sentences;
}

下面是我用来将文件读入内存并将其提供给我的句子分割器的代码部分:

FileHelper fileHelper = new FileHelper();
TextTokenizer textTokenizer = new TextTokenizer();
Constants constants = new Constants();


ArrayList<String> enMetamorph = fileHelper.readFileToMemory(
        constants.books("metamorphosis_en.txt"));

ArrayList<String> enTokenMetamorph = textTokenizer.tokenize(
        enMetamorph.toString(),Locale.US);

fileHelper.writeFile(enTokenMetamorph,constants.tokenized(
        "metamorphosis_en.txt"));

我使用的文字是Franz Kafka的 The Metamorphosis 。您可以在Project Gutenberg here上找到免费的UTF-8文本版本。 constants对象仅用于创建文件路径。我在books函数中使用了一个名为makeFilePath的函数,无论运行程序的是什么计算机,都会找到books目录。该功能如下:

public static String makeFilePath(String addition)
{
    String filePath = new File("").getAbsolutePath();
    filePath = filePath+addition;
    return filePath;
}

有人知道我为什么在我的文字中得到所有这些额外的逗号吗?

1 个答案:

答案 0 :(得分:0)

问题不在于Java的Breakiterator类,问题在于Java如何将字符串列表转换为字符串。

以下是导致问题的行

    ArrayList<String> enTokenMetamorph = textTokenizer.tokenize(enMetamorph.toString(),Locale.US);

我最终编写了自己正在使用的toString函数。它发布在下面:

public String toString(List<String> strings)
{
    StringBuilder sb = new StringBuilder();

    for(String s:strings)
    {
        sb.append(" "+s);
    }

    return sb.toString();
}

,代码行现在看起来像这样:

ArrayList<String> enTokenMetamorph = textTokenizer.tokenize(textTokenizer.toString(enMetamorph),Locale.US);

这解决了这个问题。输出现在看起来像这样:

First of all though, I've got to get up, my train leaves at five.
And he looked over at the alarm clock, ticking on the chest of drawers.
"God in Heaven!"
he thought.

与此相反:

First of all though, I've got to get up,, my train leaves at five
.
", , And he looked over at the alarm clock, ticking on the chest of, drawers.
"God in Heaven!"