我想在包含特定关键字的文本文件中获取句子。我尝试了很多但却无法获得包含关键字的正确句子....如果任何一个关键字与段落相匹配,我会有更多的关键字然后应该采取。 例如:如果我的文本文件包含抢劫,抢劫等单词,则提取该句子。以下是我尝试过的代码。无论如何使用正则表达式来解决这个问题。任何帮助将不胜感激。
BufferedReader br1 = new BufferedReader(new FileReader("/home/pgrms/Documents/test/one.txt"));
String str="";
while(br1 .ready())
{
str+=br1 .readLine() +"\n";
}
Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher match = re.matcher(str);
String sentenceString="";
while (match .find())
{
sentenceString=match.group(0);
System.out.println(sentenceString);
}
答案 0 :(得分:2)
以下是您拥有预定义关键字列表的示例:
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.regex.*;
public class Tester {
public static void main(String [] args){
try {
BufferedReader br1 = new BufferedReader(new FileReader("input"));
String[] words = {"robbery","robbed", "robbers"};
String word_re = words[0];
String str="";
for (int i = 1; i < words.length; i++)
word_re += "|" + words[i];
word_re = "[^.]*\\b(" + word_re + ")\\b[^.]*[.]";
while(br1.ready()) { str += br1.readLine(); }
Pattern re = Pattern.compile(word_re,
Pattern.MULTILINE | Pattern.COMMENTS |
Pattern.CASE_INSENSITIVE);
Matcher match = re.matcher(str);
String sentenceString="";
while (match .find()) {
sentenceString = match.group(0);
System.out.println(sentenceString);
}
} catch (Exception e) {}
}
}
这会创建一个表单的正则表达式:
[^.]*\b(robbery|robbed|robbers)\b[^.]*[.]
答案 1 :(得分:1)
通常,要检查句子是否包含rob
或robbery
或robbed
,您可以在字符串锚定开始之后,在其他正则表达式模式之前添加一个lookehead:
(?=.*(?:rob|robbery|robbed))
在这种情况下,对rob
进行分组然后检查潜在后缀会更有效:
(?=.*(?:rob(?:ery|ed)?))
在您的Java代码中,我们可以(例如)修改您的循环:
while (match.find())
{
sentenceString=match.group(0);
if (sentenceString.matches("(?=.*(?:rob(?:ery|ed)?))")) {
System.out.println(sentenceString);
}
}
解释正则表达式
(?= # look ahead to see if there is:
.* # any character except \n (0 or more times
# (matching the most amount possible))
(?: # group, but do not capture:
rob # 'rob'
(?: # group, but do not capture (optional
# (matching the most amount possible)):
ery # 'ery'
| # OR
ed # 'ed'
)? # end of grouping
) # end of grouping
) # end of look-ahead
答案 2 :(得分:0)
看看ICU Project和icu4j。它进行边界分析,因此它会为您分割句子和单词,并且会针对不同的语言进行分析。
对于其余部分,您可以根据模式匹配单词(正如其他人建议的那样),或者根据您感兴趣的单词集来检查。