Question

我有一个包含下面给出的值的字符串。我想用一些新文本替换包含特定customerId 的 html img标签。我尝试了一个小java程序，它没有给我预期的输出。这是程序信息

我的输入字符串是

String inputText = "Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/>" + "someText<img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456/> ..Ending here";

正则表达式

String regex = "(?s)\\<img.*?customerId=3340.*?>";

新文本我想放入输入字符串

编辑开始：

String newText = "<img src=\"getCustomerNew.do\">";

编辑结束：

现在我在做

String outputText = inputText.replaceAll(regex, newText);

输出

Starting here.. Replacing Text ..Ending here

但我的预期输出是

Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/>someTextReplacing Text ..Ending here

请注意我的预期输出中仅包含customerId = 3340的img标签被替换为替换文本。我没有得到为什么在输出中我得到两个img标签被重新拼写？

Answer 1

你有“通配符”/“任何”模式（.*），它会将匹配扩展到最长的匹配字符串，模式中的最后一个固定文本是{{1} } character，因此匹配输入文本中的最后一个>字符，即最后一个字符！

您应该可以通过将>部分更改为.*之类的内容来解决此问题，以便匹配不会超过第一个[^>]+字符。

使用正则表达式解析HTML必然会带来痛苦。

Answer 2

正如其他人在评论中告诉你的那样，HTML不是常规语言，因此使用正则表达式来操作它通常很痛苦。您最好的选择是使用HTML解析器。我之前没有使用过Jsoup，但谷歌搜索了一下，似乎你需要这样的东西：

import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

public class MyJsoupExample {
    public static void main(String args[]) {
        String inputText = "<html><head></head><body><p><img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123\"/></p>"
            + "<p>someText <img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456\"/></p></body></html>";
        Document doc = Jsoup.parse(inputText);
        Elements myImgs = doc.select("img[src*=customerId=3340");
        for (Element element : myImgs) {
            element.replaceWith(new TextNode("my replaced text", ""));
        }
        System.out.println(doc.toString());
    }
}

基本上，代码获取img个节点的列表，其src属性包含给定的字符串

Elements myImgs = doc.select("img[src*=customerId=3340");

然后遍历列表并用一些文本替换这些节点。

<强>更新

如果您不想用文本替换整个img节点，而是需要为其src属性赋予新值，则可以替换{{1}的块} loop with：

for

或者如果您只想更改element.attr("src", "my new value"));值的一部分，则可以执行以下操作：

src

与我发布的in this thread非常相似。

Answer 3

您的正则表达式开始匹配第一个 img 标记然后消耗所有内容（无论是否贪婪），直到找到 customerId = 3340 ，然后继续使用所有内容直到找到＆gt; 。

如果您希望仅使用 img 消费 customerId = 3340 ，请考虑一下这个标记与其他可能匹配的标记的不同之处。

在这种特殊情况下，一种可能的解决方案是使用后视运算符（不消耗匹配）来查看 img 标记背后的内容。这个正则表达式将起作用：

String regex = "(?<=</p>)<img src=\".*?customerId=3340.*?>";

为什么这个正则表达式没有给出预期的输出？

3 个答案: