用于匹配XML节点的正则表达式

时间:2019-05-21 15:17:05

标签: java regex regex-lookarounds regex-group regex-greedy

我有一系列用String表示的重复XML标签:

<Field name="foo" date="20170501">
   <Value type="foo">someVal</Value>
</Field>
<Field name="foo" date="20170501">
   <Value type="foo">someVal</Value>
</Field>

我正在尝试使用正则表达式(JAVA)从字段中提取名称属性,并在“值”节点中提取实际值。使用正则表达式可以吗?

我有以下接近的正则表达式,但它并不止于第一个结尾的</Field>标记

\\<Field([^\\>]*)\\>(.+)\\</Field\\>

3 个答案:

答案 0 :(得分:2)

如前所述,正则表达式不适合此任务,因为它的可读性和效率较低。但是无论如何...

field.xml:

<?xml version="1.0" encoding="UTF-8"?>
<Fields>
    <Field name="foo 1" date="20170501">
        <Value type="foo">someVal 1</Value>
    </Field>
    <Field name="foo 2" date="20170501">
        <Value type="foo">someVal 2</Value>
    </Field>
</Fields>

解决方案1:正则表达式(丑陋但有趣的方式...

try {
    byte[] encoded = Files.readAllBytes(Paths.get("path/to/fields/xml/file.xml"));
    String content = new String(encoded, StandardCharsets.UTF_8);

    Pattern pattern = Pattern.compile("<field[\\s\\S]*?name=\"(?<gName>[\\s\\S]*?)\"[\\s\\S]*?>[\\s\\S]*?<value\\b[\\s\\S]*?>(?<gVal>[\\s\\S]*?)</value>[\\s\\S]*?</field>", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE );
    Matcher matcher = pattern.matcher(content);

    // while loop for each <Field> entry
    while(matcher.find()) {
        matcher.group("gName"); // named group 'gName' contains the value of name attribute
        matcher.group("gVal"); // named group 'gVal' contains the text content of the value tag
    }
} catch (IOException e) {
   e.printStackTrace();
}

解决方案2:XPath正确但无聊的方式...

字段类别:

public class Field {
    private String name;
    private String value;

    // ... getter & setters ...

    @Override
    public String toString() {
        return String.format("Field { name: %s, value: %s }", this.name, this.value);
    }
}

无聊的班级:

import java.util.ArrayList;
import java.util.List;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

public class Boring {
  public static void main(String[] args) {
      DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
      factory.setNamespaceAware(true);
      DocumentBuilder builder;
      Document doc = null;

      try {
          builder = factory.newDocumentBuilder();
          doc = builder.parse("path/to/fields/xml/file.xml");

          XPathFactory xpathFactory = XPathFactory.newInstance();

          // Create XPath object
          XPath xpath = xpathFactory.newXPath();

          List<Field> fields = getFields(doc, xpath);

          for (Field f : fields) {
            System.out.println(f);
          }

      } catch (Exception e) {
          e.printStackTrace();
      }
  }

  private static List<Field> getFields(Document doc, XPath xpath) {
      List<Field> list = new ArrayList<>();
      try {
          XPathExpression expr = xpath.compile("/Fields/*");

          NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
          for (int i = 0; i < nodes.getLength(); i++) {
              Node fieldNode = nodes.item(i);
              NodeList fieldNodeChildNodes = fieldNode.getChildNodes();

              Field field = new Field();
              // set name
              field.setName(fieldNode.getAttributes().getNamedItem("name").getNodeValue());

              for (int j = 0; j < fieldNodeChildNodes.getLength(); j++) {
                  if (fieldNodeChildNodes.item(j).getNodeName() == "Value") {
                      // set value
                      field.setValue(fieldNodeChildNodes.item(j).getTextContent());
                      break;
                  }
              }
              list.add(field);
          }
      } catch (XPathExpressionException e) {
          e.printStackTrace();
      }
      return list;
  }
}

输出:

Field { name: foo 1, value: someVal 1 }
Field { name: foo 2, value: someVal 2 }

答案 1 :(得分:1)

在这里使用正则表达式这样做可能不是最好的主意。但是,如果您愿意,我们可以尝试添加可选的捕获组并收集所需的数据:

<field name="(.+?)"(.+\s*)?<value.+?>(.+?)<\/value>(\s*)?<\/field>

我们可以在此处使用i标志。

enter image description here

测试

import java.util.regex.Matcher;
import java.util.regex.Pattern;

final String regex = "<field name=\"(.+?)\"(.+\\s*)?<value.+?>(.+?)<\\/value>(\\s*)?<\\/field>";
final String string = "<Field name=\"foo\" date=\"20170501\">\n"
     + "   <Value type=\"foo\">someVal</Value>\n"
     + "</Field>\n"
     + "<Field name=\"foo\" date=\"20170501\">\n"
     + "   <Value type=\"foo\">someVal</Value>\n"
     + "</Field>\n"
     + "<Field name=\"foo\" date=\"20170501\"><Value type=\"foo\">someVal</Value></Field>\n";
final String subst = "\\1: \\3";

final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE | Pattern.CASE_INSENSITIVE);
final Matcher matcher = pattern.matcher(string);

// The substituted value will be contained in the result variable
final String result = matcher.replaceAll(subst);

System.out.println("Substitution result: " + result);

演示

此代码段只是为了说明捕获组的工作方式:

const regex = /<field name="(.+?)"(.+\s*)?<value.+?>(.+?)<\/value>(\s*)?<\/field>/gmi;
const str = `<Field name="foo" date="20170501">
   <Value type="foo">someVal</Value>
</Field>
<Field name="foo" date="20170501">
   <Value type="foo">someVal</Value>
</Field>
<Field name="foo" date="20170501"><Value type="foo">someVal</Value></Field>
`;
const subst = `$1: $3`;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);

RegEx

如果不需要此表达式,可以在regex101.com中对其进行修改或更改。

RegEx电路

jex.im还有助于可视化表达式。

enter image description here

答案 2 :(得分:0)

使用XML时,正则表达式并不是搜索它的确切方法,而应该使用Xpath来解决确切的问题。正则表达式可用于此目的,但我不建议您使用它。

您可以在几个小时内学习xpath,这是学习它的link

祝你好运