Question

我有一堆HTML文件。在这些文件中，我需要更正IMG标记的src属性。 IMG标签看起来通常如下：

<img alt="" src="./Suitbert_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />`

其中属性不是任何特定顺序。我需要在IMG标记的src属性的开头删除点和正斜杠，使它们看起来像这样：

<img alt="" src="Suitbert%20%E2%80%93%20Wikipedia_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />

到目前为止，我有以下课程：

import java.util.regex.*;


public class Replacer {

    // this PATTERN should find all img tags with 0 or more attributes before the src-attribute
    private static final String PATTERN = "<img\\.*\\ssrc=\"\\./";
    private static final String REPLACEMENT = "<img\\.*\\ssrc=\"";
    private static final Pattern COMPILED_PATTERN = Pattern.compile(PATTERN,  Pattern.CASE_INSENSITIVE);


    public static void findMatches(String html){
        Matcher matcher = COMPILED_PATTERN.matcher(html);
        // Check all occurance
        System.out.println("------------------------");
        System.out.println("Following Matches found:");
        while (matcher.find()) {
            System.out.print("Start index: " + matcher.start());
            System.out.print(" End index: " + matcher.end() + " ");
            System.out.println(matcher.group());
        }
        System.out.println("------------------------");
    }

    public static String replaceMatches(String html){
        //Pattern replace = Pattern.compile("\\s+");
        Matcher matcher = COMPILED_PATTERN.matcher(html);
        html = matcher.replaceAll(REPLACEMENT);
        return html;
    }
}

因此，我的方法findMatches(String html)似乎能够正确找到src属性以./开头的所有IMG标记。

现在我的方法replaceMatches(String html)无法正确替换匹配项。我是regex的新手，但我认为REPLACEMENT正则表达式不正确或者使用replaceAll方法或两者兼而有之。你可以看到，替换字符串包含2个部分，它们在所有IMG标记中都是相同的： <img和src="./。在这两个部分之间，原始字符串应该有0个或更多HTML属性。如何制定这样的REPLACEMENT字符串？有人可以赐教我吗？

Answer 1

不要将正则表达式用于HTML。使用parser，获取src属性并替换它。

Answer 2

试试这些：

PATTERN = "(<img[^>]*\\ssrc=\")\\./"
REPLACEMENT = "$1"

基本上，您捕获除组＃1中./之外的所有内容，然后使用$1占位符将其重新插入，从而有效地剥离./。

请注意我将.*更改为[^>]*的方式。如果在同一行上碰巧有两个IMG标签，如下所示：

<img src="good" /><img src="./bad" />

...你的正则表达式会匹配这个：

<img src="good" /><img src="./

即使您使用了非贪婪的.*?，也会这样做。 [^>]*确保匹配始终包含在一个标记内。

Answer 3

您的更换不正确。它将替换匹配的字符串（不解释为正则表达式）。如果你想实现，你想要的，你需要使用组。一个组由正则表达式的括号分隔。每个左括号表示一个新组。您可以在替换字符串中使用$ i来重现groupe匹配的内容以及'i'是您的组号参考。有关详细信息，请参阅appendReplacement的文档。

// Here is an example (it looks a bit like your case but not exactly)
String input = "<img name=\"foobar\" src=\"img.png\">";
String regexp = "<img(.+)src=\"[^\"]+\"(.*)>";
Matcher m = Pattern.compile(regexp).matcher(input);
StringBuffer sb = new StringBuffer();
while(m.find()) {
    // Found a match!
    // Append all chars before the match and then replaces the match by the 
    // replacement (the replacement refers to group 1 & 2 with $1 & $2
    // which match respectively everything between '<img' and 'src' and,
    // everything after the src value and the closing >
    m.appendReplacement(sb, "<img$1src=\"something else\"$2>";
}
m.appendTail(sb);// No more match, we append the end of input

希望这有助于你

Answer 4

如果src属性仅出现在img标记内的您的 HTML中，则可以执行以下操作：

input.replace("src=\"./", "src=\"")

如果你使用的是* nix操作系统

，你也可以使用sed在没有 java的情况下执行此操作

Java Regex - 如何替换模式或如何替换模式

4 个答案: