如何获取src url和href html

时间:2013-11-16 13:28:34

标签: java html regex url src

我有这段HT​​ML代码。我想替换三个单独属性中提到的内容的链接占位符。这是我到目前为止所尝试的:

    String texto2 = "url(\"primeiro url\")\n" +
    "url('2 url')\n" +
    "href=\"1 href\"\n" +
    "src=\"1 src\"\n" +
    "src='2 src'\n" +
    "url('3 url')\n" +
    "\n" +
    ".camera_target_content .camera_link {\n" +
    "   background: url(../images/blank.gif);\n" +
    "   display: block;\n" +
    "   height: 100%;\n" +
    "   text-decoration: none;\n" +
    "}";

    String exp = "(?:href|src)=[\"'](.+)[\"']+|(?:url)\\([\"']*(.*)[\"']*\\)";
    // expressão para pegar os links do src e do href
    Pattern pattern = Pattern.compile(exp);

    // preparando expressao
    Matcher matcher = pattern.matcher(texto2); 


    // pegando urls e guardando na lista
    while(matcher.find()) {


    System.out.println(texto2.substring(matcher.start(), matcher.end()));   
    }

到目前为止,非常好 - 只需要查找我需要获得干净的链接,就像这样:

  img/image.gif

而不是:

 href = "img/image.gif"

src =“img / image.gif” url(img / image.gif)

我想用一个变量替换一个占位符;这是我到目前为止所尝试的:

        String texto2 = "url(\"primeiro url\")\n" +
    "url('2 url')\n" +
    "href=\"1 href\"\n" +
    "src=\"1 src\"\n" +
    "src='2 src'\n" +
    "url('3 url')\n" +
    "\n" +
    ".camera_target_content .camera_link {\n" +
    "   background: url(../images/blank.gif);\n" +
    "   display: block;\n" +
    "   height: 100%;\n" +
    "   text-decoration: none;\n" +
    "}";

    String exp = "(?:href|src)=[\"'](.+)[\"']+|(?:url)\\([\"']*(.*)[\"']*\\)";
    // expressão para pegar os links do src e do href
    Pattern pattern = Pattern.compile(exp);

    // preparando expressao
    Matcher matcher = pattern.matcher(texto2); 


    // pegando urls e guardando na lista
    while(matcher.find()) {


    String s = matcher.group(2);
    System.out.println(s);  


    }

事实证明这个版本不起作用。它完美地抓住了网址;有人能帮我发现问题吗?

1 个答案:

答案 0 :(得分:0)

使用jsoup。将HTML字符串解析为DOM,然后您可以使用CSS选择器来提取值,就像使用JavaScript中的jQuery一样。请注意,这仅适用于实际使用HTML的情况;示例顶部的字符串不是HTML。