Question

我正在尝试创建正则表达式，以便我可以使用LucidWorks在我的网站上抓取和索引某些URL。

示例网址：http://www.example.com/reviews/assassins-creed-revelations/24475 / reviews / 示例网址：http://www.example.com/reviews/super-mario-3d-land/64303 / reviews /

基本上，我希望LucidWorks能够搜索我的整个网站，并仅对在网址末尾有/ reviews /的网址进行索引。

有人可以帮我构建一个表达式吗？：）

更新

网址： http://www.example.com/

包含路径： / / * / reviews / *

这种方式有用，但它只抓取第一页，它不会进入下一页有更多评论（1,2,3等）。

如果我还添加： / / / reviews /.*

我收到一大堆我不想要的网页，例如 http://www.example.com/?page=2

Answer 1

Check with this function
public boolean canAcceptURL(String url,String endsWith){
    boolean canAccept = false;
    String regex = "";
    try{
        if(endsWith.equals("")){
            endsWith = "/reviews/";
        }
    regex = "[\\x20-\\x7E]*"+endsWith+"$";//Check the url string u passed ends     with the endString you hav passed.If end string is null it will take the default value.
        canAccept = url.matches(regex);
    }catch (PatternSyntaxException pe) {
        pe.printStackTrace();
    }catch (Exception e) {
        e.printStackTrace();
    }
    System.out.println("String matches : "+canAccept);
    return canAccept;
}

Sample out put :
calling function : canAcceptURL("http://www.example.com/reviews/super-mario-3d-land/64303/reviews/","/reviews/");
String matches : true

if you want to get the url contains *'/reviews/'* just change the regex string to

String regex = "[\\x20-\\x7E]*/reviews/[\\x20-\\x7E]*"; // this will accept a string with white space and special character.

LucidWorks：Java正则表达式＆amp; GNU正则表达式

1 个答案: