Java正则表达式从文本中检索链接

时间:2018-11-22 02:40:00

标签: java regex string url text

我的输入String为:

String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";

我想将此文本转换为:

Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it

所以在这里:

1)我想用纯链接替换链接标签。如果标签包含标签,则应在URL后使用大括号。

2)如果该URL是相对URL,我想在基本URL(http://www.google.com)前面加上前缀。

3)我想向URL附加参数。 (&myParam = pqr)

我在用URL和标签检索标签并替换它时遇到问题。

我写了类似的东西:

public static void main(String[] args) {
    String text = "String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";";
    text = text.replaceAll("&lt;", "<");
    text = text.replaceAll("&gt;", ">");
    text = text.replaceAll("&amp;", "&");

    // this is not working
    Pattern p = Pattern.compile("href=\"(.*?)\"");
    Matcher m = p.matcher(text);
    String url = null;
    if (m.find()) {
        url = m.group(1);

    }
}

// helper method to append new query params once I have the url
public static URI appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
    URI oldUri = new URI(uriToUpdate);
    String newQueryParams = oldUri.getQuery();
    if (newQueryParams == null) {
        newQueryParams = queryParamsToAppend;
    } else {
        newQueryParams += "&" + queryParamsToAppend;  
    }
    URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
            oldUri.getPath(), newQueryParams, oldUri.getFragment());
    return newUri;
}

编辑1:

Pattern p = Pattern.compile("HREF=\"(.*?)\"");

这有效。但是,我希望它与大写无关。 Href,HRef,href,hrEF等都应该起作用。

此外,如果我的文本有多个URL,该如何处理。

编辑2:

一些进步。

Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
  url = m.group(1);
  System.out.println(url);
}

这处理多个URL的情况。

最后一个未解决的问题是,如何获得标签并将原始文本中的href标签替换为URL和标签。

Edit3:

对于多种URL,我的意思是给定文本中存在多个URL。

String text = "Some content which contains link as &lt;A HREF=\"/relative-path/fruit.cgi?param1=abc&amp;param2=xyz\"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF=\"/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz\"&gt;URL2 Label&lt;/A&gt; and some more text";

Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
 url = m.group(1); // this variable should contain the link URL
 url = appendBaseURI(url);
 url = appendQueryParams(url, "license=ABCXYZ");
 System.out.println(url);
}

4 个答案:

答案 0 :(得分:1)

您可以使用apache commons text StringEscapeUtils来解码html实体,然后使用replaceAll进行解码,即:

import org.apache.commons.text.StringEscapeUtils;

String text = "Some content which contains link as &lt;A HREF=\"/relative-path/fruit.cgi?param1=abc&amp;param2=xyz\"&gt;URL Label&lt;/A&gt; and some text after it";
String output = StringEscapeUtils.unescapeHtml4(text).replaceAll("([^<]+).+\"(.*?)\">(.*?)<[^>]+>(.*)", "$1https://google.com$2&your_param ($3)$4");
System.out.print(output);
// Some content which contains link as https://google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&your_param (URL Label) and some text after it

演示:

  1. jdoodle
  2. Regex Explanation

答案 1 :(得分:1)

public static void main(String args[]) {
    String text = "Some content which contains link as &lt;A HREF=\"/relative-path/fruit.cgi?param1=abc&amp;param2=xyz\"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF=\"/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz\"&gt;URL2 Label&lt;/A&gt; and some more text";
    text = StringEscapeUtils.unescapeHtml4(text);
    Pattern p = Pattern.compile("<a href=\"(.*?)\">(.*?)</a>", Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(text);
    while (m.find()) {
        text = text.replace(m.group(0), cleanUrlPart(m.group(1), m.group(2)));
    }
    System.out.println(text);
}

private static String cleanUrlPart(String url, String label) {
    if (!url.startsWith("http") && !url.startsWith("www")) {
        if (url.startsWith("/")) {
            url = "http://www.google.com" + url;
        } else {
            url = "http://www.google.com/" + url;
        }
    }
    url = appendQueryParams(url, "myParam=pqr").toString();
    if (label != null && !label.isEmpty()) url += " (" + label + ")";
    return url;
}

输出

Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it and another link http://www.google.com/relative-path/vegetables.cgi?param1=abc&param2=xyz&myParam=pqr (URL2 Label) and some more text

答案 2 :(得分:0)

  

//这不起作用

因为正则表达式区分大小写。

尝试:-

Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);

Edit1
要获取标签,请使用Pattern.compile("(?<=>).*?(?=</a>)", Pattern.CASE_INSENSITIVE)m.group(0)

Edit2
要将标签(包括标签)替换为您的最终字符串,请使用:-

text.replaceAll("(?i)<a href=\"(.*?)</a>", "new substring here")

答案 3 :(得分:0)

几乎在那里:

public static void main(String[] args) throws URISyntaxException {
        String text = "Some content which contains link as &lt;A HREF=\"/relative-path/fruit.cgi?param1=abc&amp;param2=xyz\"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF=\"/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz\"&gt;URL2 Label&lt;/A&gt; and some more text";
        text = StringEscapeUtils.unescapeHtml4(text);
        System.out.println(text);
        System.out.println("**************************************");
        Pattern patternTag = Pattern.compile("<a([^>]+)>(.+?)</a>", Pattern.CASE_INSENSITIVE);
        Pattern patternLink = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
        Matcher matcherTag = patternTag.matcher(text);

        while (matcherTag.find()) {
            String href = matcherTag.group(1); // href
            String linkText = matcherTag.group(2); // link text
            System.out.println("Href: " + href);
            System.out.println("Label: " + linkText);
            Matcher matcherLink = patternLink.matcher(href);
            String finalText = null;
            while (matcherLink.find()) {
                String link = matcherLink.group(1);
                System.out.println("Link: " + link);
                finalText = getFinalText(link, linkText);
                break;
            }
            System.out.println("***************************************");
            // replacing logic goes here
        }
        System.out.println(text);
    }

    public static String getFinalText(String link, String label) throws URISyntaxException {
        link = appendBaseURI(link);
        link = appendQueryParams(link, "myParam=ABCXYZ");
        return link + " (" + label + ")";
    }

    public static String appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
        URI oldUri = new URI(uriToUpdate);
        String newQueryParams = oldUri.getQuery();
        if (newQueryParams == null) {
            newQueryParams = queryParamsToAppend;
        } else {
            newQueryParams += "&" + queryParamsToAppend;  
        }
        URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
                oldUri.getPath(), newQueryParams, oldUri.getFragment());
        return newUri.toString();
    }

    public static String appendBaseURI(String url) {
        String baseURI = "http://www.google.com/";
        if (url.startsWith("/")) {
            url = url.substring(1, url.length());
        }
        if (url.startsWith(baseURI)) {
            return url;
        } else {
            return baseURI + url;
        }
    }