考虑到这个正则表达式:
static String AdrPattern="(?:http://www\\.([^/&]+)\\.com/|(?!^)\\G)/?([^/]+)";
我有两个小问题:
https://stackoverflow.com
)P.S :正则表达式取自here并且运行正常,但这两个缺点应该修复。
修改
根据以下代码,对此帖子的回答将跳过更多细分,只打印域名:
static String AdrPattern= "(?:(?!\\A)\\G(?:/([^\\s/]+))|http://www\\.([^\\s/&]+)\\.(?:com|net|gov|org)(?:/([^\\s/]+))?)";
static Pattern WebUrlPattern = Pattern.compile (AdrPattern);
WebUrlMatcher= WebUrlPattern.matcher(line);
int cn=0;
while(WebUrlMatcher.find()) {
if (cnt == 0)
{
String extractedPath = WebUrlMatcher.group(1);
if(extractedPath!=null){
fop.write(prefix.toLowerCase().getBytes());
fop.write(System.getProperty("line.separator").getBytes());
}
if(extractedPath!=null)
{
fop.write(extractedPath.toLowerCase().getBytes());
fop.write(System.getProperty("line.separator").getBytes());
}
String extractedPart = WebUrlMatcher.group(2);
String extractedPart = WebUrlMatcher.group(2);
String extracted2=WebUrlMatcher.group(3);
if(extractedPart!=null)
{
fop.write(extractedPart.toLowerCase().getBytes());
fop.write(System.getProperty("line.separator").getBytes());
if(extracted2!=null)
{
fop.write(extracted2.toLowerCase().getBytes());
fop.write(System.getProperty("line.separator").getBytes());
}
cnt = cnt + 1;
}
}
}
}
答案 0 :(得分:1)
这是一种方法。略微操纵当前的正则表达式。
只需测试捕获组。
"(?:(?!\\A)\\G(?:/([^\\s/]+))|http://www\\.([^\\s/&]+)\\.(?:com|net)(?:/([^\\s/]+))?)"
(?:
(?! \A ) # Not BOS
\G # Start from last match
(?:
/
( [^\s/]+ ) # (1), Required Next Segment path (or fail)
)
| # or,
http://www\. # New match
( [^\s/&]+ ) # (2), Domain
\.
(?: com | net ) # Extension
(?:
/
( [^\s/]+ ) # (3), Optional First Segment path
)?
)
测试捕获 -
输入:
http://www.asfdasdf.net/
http://www.asfdasdf.net/first
http://www.asfdasdf.net/first/second
输出:
** Grp 0 - ( pos 0 , len 23 )
http://www.asfdasdf.net
** Grp 1 - NULL
** Grp 2 - ( pos 11 , len 8 )
asfdasdf
** Grp 3 - NULL
-------------
** Grp 0 - ( pos 28 , len 29 )
http://www.asfdasdf.net/first
** Grp 1 - NULL
** Grp 2 - ( pos 39 , len 8 )
asfdasdf
** Grp 3 - ( pos 52 , len 5 )
first
-------------
** Grp 0 - ( pos 61 , len 29 )
http://www.asfdasdf.net/first
** Grp 1 - NULL
** Grp 2 - ( pos 72 , len 8 )
asfdasdf
** Grp 3 - ( pos 85 , len 5 )
first
-------------
** Grp 0 - ( pos 90 , len 7 )
/second
** Grp 1 - ( pos 91 , len 6 )
second
** Grp 2 - NULL
** Grp 3 - NULL