Question

我是Java编程的新手。我想将一个文件中的段落分成句子并将它们写在不同的文件中。还应该有机制来确定哪个句子来自哪个段落。到目前为止我使用的代码如下所述。但是这段代码打破了：

Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division.

到

Former Secretary of Finance Dr.
P.B.
Jayasundera is being questioned by the police Financial Crime Investigation Division.

我该如何纠正？提前谢谢。

import java.io.*;  
class trial4{  
    public static void main(String args[]) throws IOException   
 {  
 FileReader fr = new FileReader("input.txt");  
 BufferedReader br = new BufferedReader(fr);  
 String s;  
 OutputStream out = new FileOutputStream("output10.txt");  
                      String token[];  

 while((s = br.readLine()) != null)  
    {  
      token = s.split("(?<=[.!?])\\s* ");
      for(int i=0;i<token.length;i++)  
       {  
         byte buf[]=token[i].getBytes(); 
     for(int j=0;j<buf.length;j=j+1)  
         {  
                                out.write(buf[j]);  
                 if(j==buf.length-1)  
                        out.write('\n');  
            }  
         }  
      }  
       fr.close();  
  }  
}

我引用了StackOverFlow上发布的所有类似问题。但这些答案无法帮我解决这个问题。

Answer 1

如何将负外观与替换结合使用。简单地说：替换所有没有特殊情况的行结尾＆＃34;在他们之前的行结束后跟换行符。

＆＃34;已知缩写＆＃34;的列表将需要。不能保证这些可以是多长，或者在一行结尾可能有多短。（见？＆＃39;如果已经很短暂了！）

class trial4{  
    public static void main(String args[]) throws IOException {  
     FileReader fr = new FileReader("input.txt");  
     BufferedReader br = new BufferedReader(fr);  
     PrintStream out = new PrintStream(new FileOutputStream("output10.txt")); 

     String s = br.readLine();
     while(s != null) {  
        out.print(        //Prints newline after each line in any case
           s.replaceAll("(?i)"             //Make the match case insensitive
                 + "(?<!"                  //Negative lookbehind
                 +   "(\\W\\w)|"           //Single non-word followed by word character (P.B.)
                 +   "(\\W\\d{1,2})|"      //one or two digits (dates!)
                 +   "(\\W(dr|mr|mrs|ms))" //List of known abbreviations
                 + ")"                     //End of lookbehind                     
                 +"([!?\\.])"              //Match end-ofsentence
                    , "$5"                 //Replace with end-of-sentence found
                          +System.lineSeparator())); //Add newline if found
       s = br.readLine();
     }
   }
}

Answer 2

正如评论中所提到的那样，在没有正式确定要求的情况下将文本分成段落是很合理的。看看BreakIterator - 特别是SentenceInstance。您可以推出自己的BreakIterator，因为它与regexp相同，除非它更抽象。或者尝试找到第三方解决方案，例如http://deeplearning4j.org/sentenceiterator.html，可以训练以标记您的输入。

BreakIterator示例：

String str = "Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division.";

BreakIterator bilus = BreakIterator.getSentenceInstance(Locale.US); 
bilus.setText(str);

int last  = bilus.first();
int count = 0;

while (BreakIterator.DONE != last)
{
    int first = last;       
    last = bilus.next();

    if (BreakIterator.DONE != last)
    {
        String sentence = str.substring(first, last);
        System.out.println("Sentence:" + sentence);
        count++;
    }
}
System.out.println("" + count + " sentences found.");

将段落分成句子 - 一个特例

2 个答案: