Question

您想在字符串中找到一个URL，我使用正则表达式创建了很多关于此的主题，但我遇到了问题。使用这种模式：

String regex = "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
            "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" + 
            "|mil|biz|info|mobi|name|aero|jobs|museum" + 
            "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
            "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
            "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
            "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
            "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b";

它在大多数页面中都能很好地工作，但我对其他页面有一个问题。例如：

http://hello.com/hello world

返回

http://hello.com/hello

问题在于空间。

任何人都有一个很好的模式来解决这个问题吗？

感谢。

编辑::这是我的代码

private ArrayList<String> pullLinks(String text) {
    ArrayList<String> links = new ArrayList<String>();

    String regex = "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
            "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" + 
            "|mil|biz|info|mobi|name|aero|jobs|museum" + 
            "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
            "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
            "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
            "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
            "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b";

    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(text);
    while(m.find()) {
    String urlStr = m.group();
    if (urlStr.startsWith("(") && urlStr.endsWith(")"))
    {
    urlStr = urlStr.substring(1, urlStr.length() - 1);
    }
    links.add(urlStr);
    }
    return links;
    }

Answer 1

网址中不允许使用空格（需要将其替换为%20）。例如，请参阅此问题的答案：

Spaces in URLs?

如果您允许网址包含空格，那么您如何解释实例http://www.google.com/ig is a nice webpage？显然，不应该包括/ig之后的部分！

Answer 2

空格不是有效的网址字符。

另外，如果你不使用空格作为终结符，你将如何找到URL的结尾？

您的正则表达式也无法考虑其他顶级域名（如.int）。我真的不确定为什么要查找特定的TLD，因为它们不需要形成有效的URL。

在String中查找URL

2 个答案: