我正在尝试创建正则表达式,以便我可以使用LucidWorks在我的网站上抓取和索引某些URL。
示例网址:http://www.example.com/reviews/assassins-creed-revelations/24475 / reviews / 示例网址:http://www.example.com/reviews/super-mario-3d-land/64303 / reviews /
基本上,我希望LucidWorks能够搜索我的整个网站,并仅对在网址末尾有/ reviews /的网址进行索引。
有人可以帮我构建一个表达式吗? :)
更新
网址: http://www.example.com/
包含路径: / / * / reviews / *
这种方式有用,但它只抓取第一页,它不会进入下一页有更多评论(1,2,3等)。
如果我还添加: / / / reviews /.*
我收到一大堆我不想要的网页,例如 http://www.example.com/?page=2
答案 0 :(得分:0)
Check with this function
public boolean canAcceptURL(String url,String endsWith){
boolean canAccept = false;
String regex = "";
try{
if(endsWith.equals("")){
endsWith = "/reviews/";
}
regex = "[\\x20-\\x7E]*"+endsWith+"$";//Check the url string u passed ends with the endString you hav passed.If end string is null it will take the default value.
canAccept = url.matches(regex);
}catch (PatternSyntaxException pe) {
pe.printStackTrace();
}catch (Exception e) {
e.printStackTrace();
}
System.out.println("String matches : "+canAccept);
return canAccept;
}
Sample out put :
calling function : canAcceptURL("http://www.example.com/reviews/super-mario-3d-land/64303/reviews/","/reviews/");
String matches : true
if you want to get the url contains *'/reviews/'* just change the regex string to
String regex = "[\\x20-\\x7E]*/reviews/[\\x20-\\x7E]*"; // this will accept a string with white space and special character.