我有一个包含复杂范围的URL的文本文件。这是一个示例:
https://www.google.com/?gws_rd=ssl
http://www.cs.jhu.edu/news-events/news-articles/
maps.google.com
http://www.cnn.com/WORLD/?hpt=sitenav
http://www.cnn.com/JUSTICE/?hpt=sitenav
http://www.cs.jhu.edu/course-info/
http://e-catalog.jhu.edu/departments-program-requirements-and-courses/engineering/computer-science/
http://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html
http://mexico.cnn.com/?hpt=ed_Mexico
cnn.com
从这些方面来说,我只想获得“X.Y”部分。换句话说,从前4行开始,我想得到:
google.com
jhu.edu
google.com
cnn.com
为了做到这一点,我做了一个正则表达式,我试图匹配它:
public static void main(String[] args) throws IOException {
BufferedReader reader = new BufferedReader(new FileReader("C:\\Users\\Me\\Desktop\\homework4file.txt"));
String line = null;
Pattern pattern = Pattern.compile("^[a-zA-Z0-9\\-\\.]+\\.(com)$");
Matcher matcher;
while((line = reader.readLine()) != null) {
matcher = pattern.matcher(line);
while(matcher.find()) {
System.out.println(matcher.group(1));
}
}
}
我的正则表达式只是为每一行返回“com”。我没有看到我所写的内容有什么问题。有人可以解释我的表达中的逻辑错误吗?
答案 0 :(得分:1)
您不需要放置锚点。 ^
声称我们刚开始,但.com
之前的部分不在起点。在[a-zA-Z0-9\\-\\.]+
到达.com
之前,/
会贪婪地匹配该部分。在此http://mexico.cnn.com/?hpt=ed_Mexico
字符串中,正则表达式[a-zA-Z0-9\\-\\.]+\\.(com)
将匹配mexico.cnn.com
而不是cnn.com
。并且将com
,edu
置于非捕获中由|
分隔的组也会匹配.edu
之前的字符串。
[^.\\n]+\\.(?:com|edu)
String input = "https://www.google.com/?gws_rd=ssl\n" +
"http://www.cs.jhu.edu/news-events/news-articles/\n" +
"maps.google.com\n" +
"http://www.cnn.com/WORLD/?hpt=sitenav\n" +
"http://www.cnn.com/JUSTICE/?hpt=sitenav\n" +
"http://www.cs.jhu.edu/course-info/\n" +
"http://e-catalog.jhu.edu/departments-program-requirements-and-courses/engineering/computer-science/\n" +
"http://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html\n" +
"http://mexico.cnn.com/?hpt=ed_Mexico\n" +
"cnn.com";
Pattern regex = Pattern.compile("[^.\\n]+\\.(?:com|edu)");
Matcher matcher = regex.matcher(input);
while(matcher.find()){
System.out.println(matcher.group(0));
}
<强>输出:强>
google.com
jhu.edu
google.com
cnn.com
cnn.com
jhu.edu
jhu.edu
oracle.com
cnn.com
cnn.com