如何在java中使用正则表达式获得完整的句子

时间:2015-08-19 05:20:30

标签: java regex

截至目前,我正在使用PDFBox解析PDF,稍后我将解析其他文档(.docx / .doc)。使用PDFBox,我将所有文件内容都放到一个字符串中。现在,我想在用户定义单词匹配的地方获得完整的句子。

例如:

... some text here..
Raman took more than 12 year to complete his schooling and now he
is pursuing higher study.
Relational Database. 
... some text here ..

如果用户提供输入year,那么它应该返回整个句子。

预期输出:

Raman took more than 12 year to complete his schooling and now he
    is pursuing higher study.

我正在尝试下面的代码,但它没有显示任何内容。任何人都可以纠正这个

 Pattern pattern = Pattern.compile("[\\w|\\W]*+[YEAR]+[\\w]*+.");

另外,如果我必须包含多个单词以匹配OR条件,那么我应该在正则表达式中进行哪些更改?

请注意所有字词均为大写。

3 个答案:

答案 0 :(得分:1)

不要试图将所有内容放入单个正则表达式中。有一个标准的Java类java.text.BreakIterator可用于查找句子边界。

public static String getSentence(String input, String word) {
    Matcher matcher = Pattern.compile(word, Pattern.LITERAL | Pattern.CASE_INSENSITIVE)
                             .matcher(input);
    if(matcher.find()) {
        BreakIterator br = BreakIterator.getSentenceInstance(Locale.ENGLISH);
        br.setText(input);
        int start = br.preceding(matcher.start());
        int end = br.following(matcher.end());
        return input.substring(start, end);
    }
    return null;
}

用法:

public static void main(String[] args) {
    String input = "... some text...\n Raman took more than 12 year to complete his schooling and now he\nis pursuing higher study. Relational Database. \n... some text...";
    System.out.println(getSentence(input, "YEAR"));
}

答案 1 :(得分:0)

Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)      [^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
 Matcher reMatcher = re.matcher(result);

 while (reMatcher.find()) {

 System.out.println(reMatcher.group());
                    }

答案 2 :(得分:0)

对@Tagir Valeev的一个小修复回答是为了防止索引超出范围。

 private String getSentence(String input, String word) {
        Matcher matcher = Pattern.compile(word , Pattern.LITERAL | Pattern.CASE_INSENSITIVE)
                .matcher(input);
        if(matcher.find()) {
            BreakIterator br = BreakIterator.getSentenceInstance(Locale.ENGLISH);
            br.setText(input);
            int start = br.preceding(matcher.start());
            int end = br.following(matcher.end());

            if(start == BreakIterator.DONE) {
                start = 0;
            }

            if(end == BreakIterator.DONE) {
                end = input.length();
            }

            return input.substring(start, end);
        }

        return null;
    }