Question

我想在html字符串中用另一个替换类名：class="abc"将变为class="xyz"。我尝试使用正则表达式（我正在使用C＃）但没有成功：

const string input = @"abc class=""abcd abc zabc ab c"" abc";

Regex regex = new Regex(string.Format(@"class="".*(?({0})).*""", "abc")); // change this line ?!!

string output = regex.Replace(input, "xyz");

Assert.AreEqual(@"abc class=""abcd xyz zabc ab c"" abc", output);

PS：如果重要：这不是作业：p

Answer 1

难怪你没有成功。 Parsing HTML can't be done using regexes.

您应该使用正确的HTML解析器，例如HTML Agility Pack。

Answer 2

Parsing HTML with Regular Expressions tends to be a futile effort;因为大多数浏览器对于格式错误的HTML都有相当大的余地，所以不能保证一致地形成HTML以便轻松地解析正则表达式（并且由svick评论）。

那就是说，你最好使用正式的HTML解析器（我推荐HTML Agility Pack），然后在解析文档后更改属性的值，然后在需要时输出已更改的文档

Answer 3

这是一个真正的HTML字符串吗？我的意思是，你确定你正在处理格式良好的HTML吗？你的字符串中可能有一些错误吗？

根据您上面给出的答案，您可以选择如何解决问题。

是的：使用 HTML Agility Pack 或类似内容，以便正确解析您的字符串;
Nope ：考虑使用XML Parser（就像.NET程序集中集成的那些）。但请确保它适合您（记住XML不是HTML ）。

无论您选择什么，请：从不使用正则表达式来解析HTML。

Answer 4

我已尽最大努力回答这个问题......可以使用类似以下的REGEX：

@"(?<=<[\w-]+\s+([\w-]+=""[^""]*""\s*)*class=""[^""]*)(?<![\w-])abc(?![\w-])(?=[^""]*""\s*([\w-]+=""[^""]*""\s*)*/?>)"

分解了一下：

(?<=<[\w-]+\s+([\w-]+=""[^""]*""\s*)*class=""[^""]*)  #Make sure its inside a tag
(?<![\w-])abc(?![\w-])                                #just the tag abc (not abcd, etc)
(?=[^""]*""\s*([\w-]+=""[^""]*""\s*)*/?>)             #Make sure its really INSIDE a tag

再远一点：

(?<=                           #lookbehind
   <[\w-]+\s+                  # match tag name and whitespace
   ([\w-]+=""[^""]*""\s*)*     # match any attributes coming before the class attribute
   class=""[^""]*              # match the class attribute and any other classes before
)                              #end lookbehind
(?<![\w-])abc(?![\w-])         #"abc" at appropriate boundaries
(?=                            #lookahead
   [^""]*""                    # match any remaining classes in the declaration
   \s*([\w-]+=""[^""]*""\s*)*  # match any remaining attributes in the tag
   /?>                         # match the end of the tag
)                              #end lookahead

这将匹配标记内部的任何abc属性值内的字符串class（不在标记之间的文本中），并且可能在其之前或之后具有其他属性。

注意！

IT ONLY HANDLES属性值用双引号（"）

它只允许在标记和属性名称中使用下划线，字母，数字和短划线符号 - 如果需要，您需要添加冒号和句点（并且如果您想要它，则只匹配名称STARTING with letter ）

编辑正如此处某处的评论所述，除了abc-1之外，它还将not-abc或abc匹配，从而转向{{1}进入<p class="abc-1 abc not-abc">text</p> - 因为\ b将匹配破折号字符......这会非常难以计算！ 后续行动我添加了一个额外的前瞻和后视，希望能够解释破折号，但谁知道...... END EDITS

此外，还有其他情况可以打破这个......

简而言之 - 最好不要使用它，而是使用像HTML Agility Pack这样的东西 - 祝你好运！

Answer 5

我不确定这个正则表达式的C＃版本，但是这里是如何在Ruby中完成的：

regex = / class="[^"]*"/i

input.gsub( regex, ' class="abc"' )

这将输入中的类说明符的第一个实例替换为class =“abc”。假设等于周围没有空格，但允许大写或小写等价。

我认为C＃在描述正则表达式方面非常相似，你可能不得不逃避双引号。

你在寻找更具体的东西吗？例如，对于一个采用两个输入（s1和s2）并将类“s1”替换为类“s2”的方法？

Answer 6

显然，在使用XML时，Regex不太可能是您的最佳选择。如果您尝试其他人建议的内容，您可能会得到更一致的结果。同时，如果你真的想要一些Regex，那就是：

const string input = @"abc class=""abcd abc zabc ab c"" abc"; 

Regex regex = new Regex(string.Format(@"(?<=class\=""[^""]*\b){0}\b", "abc")); // I changed this line ?!! 

string output = regex.Replace(input, "xyz");

Assert.AreEqual(@"abc class=""abcd xyz zabc ab c"" abc", output);

要把它踩下来：

(               #Start a group
    ?<=         #Positive lookbehind
    class\="    #Some charactors to match against (without consuming)
    [^"]*       #Any other charachactors which are not "
                #This stops us from accidentaly leaving the class attribute
)               #Close the lookbehind group
\b              #A word boundry (Such as whitespace or just before a ")
abc             #Your target
\b              #Another word boundry

注意positve lookbehind意味着我们检查“class =”而不是我们匹配的一部分。这就是我们所说的“不消费”。

请注意使用单词boundries，\ b，以便我们不会意外地匹配abcd。

Answer 7

声明：

正如其他人所指出的那样，使用正则表达式解析非常规语言充满了危险！最好使用专门为该作业设计的专用解析器，尤其是在解析HTML标签汤时。

那说......

如果你坚持使用正则表达式，那么这是一个可以做得很好的正则表达式解决方案：

text = Regex.Replace(text, @"
    # Change HTML element class attribute value: 'abc' to: 'xyz'.
    (                   # $1: Everything up to 'abc'.
      <\w+              # Begin (X)HTML element open tag.
      (?:               # Match any attribute(s) preceding 'class'.
        \s+             # Whitespace required before each attribute.
        (?!class\b)     # Assert this attribute name is not 'class'.
        [\w\-.:]+       # Required attribute name.
        (?:             # Begin optional attribute value.
          \s*=\s*       # Attribute value separated by =.
          (?:           # Group for attrib value alternatives.
            ""[^""]*""  # Either a double quoted value,
          | '[^']*'     # or a single quoted value,
          | [\w\-.:]+   # or an unquoted value.
          )             # End group for attrib value alternatives.
        )?              # End optional attribute value.
      )*                # Zero or more attributes may precede class.
      \s+               # Whitespace required before class attribute.
      class             # Literal class attribute name.
      \s*=\s*           # Attribute value separated by =.
      (?:               # Group for attrib value alternatives.
        ""              # Either a double quoted value.
        [^""]*?         # Zero or more classes may precede 'abc'.
      | '               # Or a single quoted value.
        [^']*?          # Zero or more classes may precede 'abc'.
      )?                # Or 'abc' class attrib value is unquoted.
    )                   # End $1: Everything up to 'abc'.
    (?<=['""\s=])       # Assert 'abc' not part of '123-abc'.
    abc                 # Match the 'abc' in class attribute value.
    (?=['""\s>])        # Assert 'abc' not part of 'abc-123'.",
    "$1xyz", RegexOptions.IgnorePatternWhitespace);

示例输入：

class=abc ... class="abc" ... class='abc'
class = abc ... class = "abc" ... class = 'abc'
class="123 abc 456" ... class='123 abc 456'
class="123-abc abc 456-abc" ... class='123-abc abc 456-abc'
class="abc-123 abc abc-456" ... class='abc-123 abc abc-456'

示例输出：

class=xyz ... class="xyz" ... class='xyz'
class = xyz ... class = "xyz" ... class = 'xyz'
class="123 xyz 456" ... class='123 xyz 456'
class="123-abc xyz 456-abc" ... class='123-abc xyz 456-abc'
class="abc-123 xyz abc-456" ... class='abc-123 xyz abc-456'

请注意，此解决方案将始终存在边缘情况。例如CDATA部分，注释，脚本，样式和标记属性值中的邪恶字符串可能会使其失效。（参见上面的免责声明。）也就是说，这个解决方案在许多情况下都会做得很好（但永远不会 100％可靠！）

编辑：2011-10-10 14:00 MDT 简化了总体答案。删除了第一个正则表达式解决方案。修改为正确忽略具有相似名称的类，如：abc-123和123-abc。

用html字符串替换另一个类

7 个答案:

声明：

那说......

示例输入：

示例输出：