参考这个问题:Score each sentence in a line based upon a tag and summarize the text. (Java),我正在研究Java中的摘要生成器。 现在,在上面提到的问题已经完成所有这些之后,一个小问题仍然存在。 我正在重述我正在尝试做的事情。从给定的文本文件中,我想获取单个句子,然后在某些标签的基础上对它们进行评分,最后,将高分的句子写入包含摘要的文件中。 这是代码的相关部分(非常感谢上述问题中的回答者):
ArrayList<Integer> scoreTracker = new ArrayList<Integer>();
Pattern tagFinder = Pattern.compile("/JJ|" + freqWords); // tag for adjective,if you need to add more tags,just use - /JJ|/RB|/NN and so on..every tag separated by | .. tag1|tag2|tag3 and so on
while ((line = reader.readLine()) != null) { // scan input file line by line
BreakIterator bi = BreakIterator.getSentenceInstance(); // one line may contain many sentences,so using BreakIterator to break sentences
bi.setText(line);
int end, start = bi.first();
while ((end = bi.next()) != BreakIterator.DONE) { // for every new sentence in line
String sentence = line.substring(start, end); // store one sentence
String tagged = tagger.tagString(sentence); // tag this sentence
int score = 0; // score for the sentence
Matcher tag = tagFinder.matcher(tagged); //using a Matcher to find the tag in the tagged sentence
while (tag.find()) // if tag exists
score++; // increment score by one
scoreTracker.add(score);
if (score > 5) // for a score greater than 5,write the sentence,not the line,into the summary
writerForTempFile.write(sentence);
start = end; // set start = end to commence to the next sentence in the line
}
}
System.out.println(scoreTracker);
现在问题: 我有以下文本文件供我测试:
This is a sample text.
This is a new line in the sample document.This next line is just to test adjacent sentences in the document.Because test runs suggest that immediate sentences are included in the final result due to new line delimiter usage and not sentence terminator usage.
Then we have a paragraph space.
Then there is this long line that has many words ,so it should be important. Should it?
That should be enough for testing.
test test test test test test.
它包含9个句子。但该程序只找到7.它确实得分。但是,在那之后,我的条件是用score>5
打印句子。但它也打印出score<5
的某些句子。我添加了一个ArrayList
来跟踪每个句子的得分。这就是我所知道的,只有7个句子得分,而得分少于5的句子也被打印出来。
请耐心等待,经过大量试验,我无法找到我出错的地方。