正则表达式内部包含尖括号的XML标记

时间:2016-04-07 12:33:08

标签: java regex xml

我需要一个正则表达式,它会给我一个XML标签,例如<ABC/><ABC></ABC>

所以,如果我使用<(.)+?>,它会给我<ABC><ABC></ABC>。这很好。

现在,问题是:

我有一个XML

<VALUE ABC="10000" PQR="12422700" ADJ="" PROD_TYPE="COCOG EFI LWL P&amp;C >1Y-5Y" SRC="BASE" DATA="data" ACTION="INSERT" ID="100000" GRC_PROD=""/>

此处,如果您看到,PROD_TYPE="COCOG EFI LWL P&amp;C >1Y-5Y"在属性值中的符号大于。

所以,正则表达式让我回头

<VALUE ABC="10000" PQR="12422700" ADJ="" PROD_TYPE="COCOG EFI LWL P&amp;C >

而不是完整的

<VALUE ABC="10000" PQR="12422700" ADJ="" PROD_TYPE="COCOG EFI LWL P&amp;C >1Y-5Y" SRC="BASE" DATA="data" ACTION="INSERT" ID="100000" GRC_PROD=""/>

我需要一些正则表达式,它不会考虑小于和大于作为值的一部分的符号,即用双引号括起来。

3 个答案:

答案 0 :(得分:1)

你可以试试这个:

(?i)<[a-z][\w:-]+(?: [a-z][\w:-]+="[^"]*")*/?>

解释如下:

(?i)         # Match the remainder of the regex with the options: case insensitive (i)
<            # Match the character “<” literally
[a-z]        # Match a single character in the range between “a” and “z”
[\\w:-]       # Match a single character present in the list below
                # A word character (letters, digits, and underscores)
                # The character “:”
                # The character “-”
   +            # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?:          # Match the regular expression below
   \\            # Match the character “ ” literally
   [a-z]        # Match a single character in the range between “a” and “z”
   [\\w:-]       # Match a single character present in the list below
                   # A word character (letters, digits, and underscores)
                   # The character “:”
                   # The character “-”
      +            # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   =\"           # Match the characters “=\"” literally
   [^\"]         # Match any character that is NOT a “\"”
      *            # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   \"            # Match the character “\"” literally
)*           # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
/            # Match the character “/” literally
   ?            # Between zero and one times, as many times as possible, giving back as needed (greedy)
>            # Match the character “>” literally

如果您想要加入opencloseself-closed代码,请尝试以下RegEx

(?i)(?:<([a-z][\w:-]+)(?: [a-z][\w:-]+="[^"]*")*>.+?</\1>|<([a-z][\w:-]+)(?: [a-z][\w:-]+="[^"]*")*/>)

实现相同的java代码片段:

try {
    boolean foundMatch = subjectString.matches("(?i)(?:<([a-z][\\w:-]+)(?: [a-z][\\w:-]+=\"[^\"]*\")*>.+?</\\1>|<([a-z][\\w:-]+)(?: [a-z][\\w:-]+=\"[^\"]*\")*/>)");
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

希望这会有所帮助......

答案 1 :(得分:1)

要扩展G_H链接点:Don't use regex to parse XML.使用XPath返回节点,并将该节点传递给标识Transformer

Node valueElement = (Node)
    XPathFactory.newInstance().newXPath().evaluate("//VALUE",
        new InputSource(new StringReader(xmlDocument)),
        XPathConstants.NODE);

StringWriter result = new StringWriter();
TransformerFactory.newInstance().newTransformer().transform(
    new DOMSource(valueElement), new StreamResult(result));

String valueElementMarkup = result.toString();

答案 2 :(得分:0)

也试试这个:

<.*?(".*?".*?)*?>

只有存在偶数个<双引号时,它才会抓取>"之间的所有内容。成对的双引号表示包含的内容。否则它会跳过>符号并继续搜索下一个>(这应该在关闭"引用后发生)