我是Java编程的新手。我想将一个文件中的段落分成句子并将它们写在不同的文件中。还应该有机制来确定哪个句子来自哪个段落。到目前为止我使用的代码如下所述。但是这段代码打破了:
Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division.
到
Former Secretary of Finance Dr.
P.B.
Jayasundera is being questioned by the police Financial Crime Investigation Division.
我该如何纠正?提前谢谢。
import java.io.*;
class trial4{
public static void main(String args[]) throws IOException
{
FileReader fr = new FileReader("input.txt");
BufferedReader br = new BufferedReader(fr);
String s;
OutputStream out = new FileOutputStream("output10.txt");
String token[];
while((s = br.readLine()) != null)
{
token = s.split("(?<=[.!?])\\s* ");
for(int i=0;i<token.length;i++)
{
byte buf[]=token[i].getBytes();
for(int j=0;j<buf.length;j=j+1)
{
out.write(buf[j]);
if(j==buf.length-1)
out.write('\n');
}
}
}
fr.close();
}
}
我引用了StackOverFlow上发布的所有类似问题。但这些答案无法帮我解决这个问题。
答案 0 :(得分:0)
如何将负外观与替换结合使用。简单地说:替换所有没有特殊情况的行结尾&#34;在他们之前的行结束后跟换行符。
&#34;已知缩写&#34;的列表将需要。不能保证这些可以是多长,或者在一行结尾可能有多短。 (见?&#39;如果已经很短暂了!)
class trial4{
public static void main(String args[]) throws IOException {
FileReader fr = new FileReader("input.txt");
BufferedReader br = new BufferedReader(fr);
PrintStream out = new PrintStream(new FileOutputStream("output10.txt"));
String s = br.readLine();
while(s != null) {
out.print( //Prints newline after each line in any case
s.replaceAll("(?i)" //Make the match case insensitive
+ "(?<!" //Negative lookbehind
+ "(\\W\\w)|" //Single non-word followed by word character (P.B.)
+ "(\\W\\d{1,2})|" //one or two digits (dates!)
+ "(\\W(dr|mr|mrs|ms))" //List of known abbreviations
+ ")" //End of lookbehind
+"([!?\\.])" //Match end-ofsentence
, "$5" //Replace with end-of-sentence found
+System.lineSeparator())); //Add newline if found
s = br.readLine();
}
}
}
答案 1 :(得分:0)
正如评论中所提到的那样,在没有正式确定要求的情况下将文本分成段落是很合理的。看看BreakIterator - 特别是SentenceInstance。您可以推出自己的BreakIterator,因为它与regexp相同,除非它更抽象。或者尝试找到第三方解决方案,例如http://deeplearning4j.org/sentenceiterator.html,可以训练以标记您的输入。
BreakIterator示例:
String str = "Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division.";
BreakIterator bilus = BreakIterator.getSentenceInstance(Locale.US);
bilus.setText(str);
int last = bilus.first();
int count = 0;
while (BreakIterator.DONE != last)
{
int first = last;
last = bilus.next();
if (BreakIterator.DONE != last)
{
String sentence = str.substring(first, last);
System.out.println("Sentence:" + sentence);
count++;
}
}
System.out.println("" + count + " sentences found.");