截至目前,我正在使用PDFBox解析PDF,稍后我将解析其他文档(.docx / .doc)。使用PDFBox,我将所有文件内容都放到一个字符串中。现在,我想在用户定义单词匹配的地方获得完整的句子。
例如:
... some text here..
Raman took more than 12 year to complete his schooling and now he
is pursuing higher study.
Relational Database.
... some text here ..
如果用户提供输入year
,那么它应该返回整个句子。
预期输出:
Raman took more than 12 year to complete his schooling and now he
is pursuing higher study.
我正在尝试下面的代码,但它没有显示任何内容。任何人都可以纠正这个
Pattern pattern = Pattern.compile("[\\w|\\W]*+[YEAR]+[\\w]*+.");
另外,如果我必须包含多个单词以匹配OR
条件,那么我应该在正则表达式中进行哪些更改?
请注意所有字词均为大写。
答案 0 :(得分:1)
不要试图将所有内容放入单个正则表达式中。有一个标准的Java类java.text.BreakIterator
可用于查找句子边界。
public static String getSentence(String input, String word) {
Matcher matcher = Pattern.compile(word, Pattern.LITERAL | Pattern.CASE_INSENSITIVE)
.matcher(input);
if(matcher.find()) {
BreakIterator br = BreakIterator.getSentenceInstance(Locale.ENGLISH);
br.setText(input);
int start = br.preceding(matcher.start());
int end = br.following(matcher.end());
return input.substring(start, end);
}
return null;
}
用法:
public static void main(String[] args) {
String input = "... some text...\n Raman took more than 12 year to complete his schooling and now he\nis pursuing higher study. Relational Database. \n... some text...";
System.out.println(getSentence(input, "YEAR"));
}
答案 1 :(得分:0)
Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$) [^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher reMatcher = re.matcher(result);
while (reMatcher.find()) {
System.out.println(reMatcher.group());
}
答案 2 :(得分:0)
对@Tagir Valeev的一个小修复回答是为了防止索引超出范围。
private String getSentence(String input, String word) {
Matcher matcher = Pattern.compile(word , Pattern.LITERAL | Pattern.CASE_INSENSITIVE)
.matcher(input);
if(matcher.find()) {
BreakIterator br = BreakIterator.getSentenceInstance(Locale.ENGLISH);
br.setText(input);
int start = br.preceding(matcher.start());
int end = br.following(matcher.end());
if(start == BreakIterator.DONE) {
start = 0;
}
if(end == BreakIterator.DONE) {
end = input.length();
}
return input.substring(start, end);
}
return null;
}