Recognize an absolute web URL even without a scheme

时间:2016-04-04 16:28:31

标签: java regex url url-scheme absolute-path

I am working on a Java method that differentiates between absolute and relative URLs the way a browser address bar would rather than the way a strict URL parser would. That is, I want it to recognize a URL as absolute if it starts with a host, whether or not the scheme is present. That way, it correctly recognizes scheme-relative URLs (like //example.com) and URLs with the scheme completely omitted (like example.com, wikipedia.org, lots.and-lots.of.domains.com.ng). The method I', currently using looks something like this

public String checkPossiblyAbsolute(String url) {
    if (url.matches("^(\\/\\/)?([-_A-Za-z0-9]+\\.)+\\w{2,3}(\\/.*)?$")) {
        if (url.startsWith("//")) url = "http:" + url;
        else url = "http://" + url;
    }
    return url;
}

Basically, it checks for dot separated sequences of the characters A-Z, a-z, 0-9, -, and _ where the last sequence (the TLD) contains exactly 2 or 3 letters. Also, the string may start with an optional //. My tests work the way I expected, but I really want to find an easier (or at least more readable) way to do this. Any thoughts?

1 个答案:

答案 0 :(得分:0)

Unfortunately Java does not allow you to avoid double escaping things. (Some languages allow @"une\scapedRegex").

There are some modifications you can make to the regex syntax, however.

  • \\. can become [.] Not shorter, but IMHO more readable.
  • Same with \\/. Make it [/].
  • You can get rid of A-Z if you use case insensitive mode. May not be worth it when you have only one A-Z.

There's not much more you can do, except put things in variables. Again, may not be worth it if you have only a few redundancies, but it could improve readability. You're using Java, so you're not winning code-golf, anyways.