Question

我有一系列用String表示的重复XML标签：

<Field name="foo" date="20170501">
   <Value type="foo">someVal</Value>
</Field>
<Field name="foo" date="20170501">
   <Value type="foo">someVal</Value>
</Field>

我正在尝试使用正则表达式（JAVA）从字段中提取名称属性，并在“值”节点中提取实际值。使用正则表达式可以吗？

我有以下接近的正则表达式，但它并不止于第一个结尾的</Field>标记

\\<Field([^\\>]*)\\>(.+)\\</Field\\>

Answer 1

如前所述，正则表达式不适合此任务，因为它的可读性和效率较低。但是无论如何...

field.xml：

<?xml version="1.0" encoding="UTF-8"?>
<Fields>
    <Field name="foo 1" date="20170501">
        <Value type="foo">someVal 1</Value>
    </Field>
    <Field name="foo 2" date="20170501">
        <Value type="foo">someVal 2</Value>
    </Field>
</Fields>

解决方案1：正则表达式（丑陋但有趣的方式... ）

try {
    byte[] encoded = Files.readAllBytes(Paths.get("path/to/fields/xml/file.xml"));
    String content = new String(encoded, StandardCharsets.UTF_8);

    Pattern pattern = Pattern.compile("<field[\\s\\S]*?name=\"(?<gName>[\\s\\S]*?)\"[\\s\\S]*?>[\\s\\S]*?<value\\b[\\s\\S]*?>(?<gVal>[\\s\\S]*?)</value>[\\s\\S]*?</field>", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE );
    Matcher matcher = pattern.matcher(content);

    // while loop for each <Field> entry
    while(matcher.find()) {
        matcher.group("gName"); // named group 'gName' contains the value of name attribute
        matcher.group("gVal"); // named group 'gVal' contains the text content of the value tag
    }
} catch (IOException e) {
   e.printStackTrace();
}

解决方案2：XPath（正确但无聊的方式... ）

字段类别：

public class Field {
    private String name;
    private String value;

    // ... getter & setters ...

    @Override
    public String toString() {
        return String.format("Field { name: %s, value: %s }", this.name, this.value);
    }
}

无聊的班级：

import java.util.ArrayList;
import java.util.List;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

public class Boring {
  public static void main(String[] args) {
      DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
      factory.setNamespaceAware(true);
      DocumentBuilder builder;
      Document doc = null;

      try {
          builder = factory.newDocumentBuilder();
          doc = builder.parse("path/to/fields/xml/file.xml");

          XPathFactory xpathFactory = XPathFactory.newInstance();

          // Create XPath object
          XPath xpath = xpathFactory.newXPath();

          List<Field> fields = getFields(doc, xpath);

          for (Field f : fields) {
            System.out.println(f);
          }

      } catch (Exception e) {
          e.printStackTrace();
      }
  }

  private static List<Field> getFields(Document doc, XPath xpath) {
      List<Field> list = new ArrayList<>();
      try {
          XPathExpression expr = xpath.compile("/Fields/*");

          NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
          for (int i = 0; i < nodes.getLength(); i++) {
              Node fieldNode = nodes.item(i);
              NodeList fieldNodeChildNodes = fieldNode.getChildNodes();

              Field field = new Field();
              // set name
              field.setName(fieldNode.getAttributes().getNamedItem("name").getNodeValue());

              for (int j = 0; j < fieldNodeChildNodes.getLength(); j++) {
                  if (fieldNodeChildNodes.item(j).getNodeName() == "Value") {
                      // set value
                      field.setValue(fieldNodeChildNodes.item(j).getTextContent());
                      break;
                  }
              }
              list.add(field);
          }
      } catch (XPathExpressionException e) {
          e.printStackTrace();
      }
      return list;
  }
}

输出：

Field { name: foo 1, value: someVal 1 }
Field { name: foo 2, value: someVal 2 }

Answer 2

在这里使用正则表达式这样做可能不是最好的主意。但是，如果您愿意，我们可以尝试添加可选的捕获组并收集所需的数据：

<field name="(.+?)"(.+\s*)?<value.+?>(.+?)<\/value>(\s*)?<\/field>

我们可以在此处使用i标志。

测试

import java.util.regex.Matcher;
import java.util.regex.Pattern;

final String regex = "<field name=\"(.+?)\"(.+\\s*)?<value.+?>(.+?)<\\/value>(\\s*)?<\\/field>";
final String string = "<Field name=\"foo\" date=\"20170501\">\n"
     + "   <Value type=\"foo\">someVal</Value>\n"
     + "</Field>\n"
     + "<Field name=\"foo\" date=\"20170501\">\n"
     + "   <Value type=\"foo\">someVal</Value>\n"
     + "</Field>\n"
     + "<Field name=\"foo\" date=\"20170501\"><Value type=\"foo\">someVal</Value></Field>\n";
final String subst = "\\1: \\3";

final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE | Pattern.CASE_INSENSITIVE);
final Matcher matcher = pattern.matcher(string);

// The substituted value will be contained in the result variable
final String result = matcher.replaceAll(subst);

System.out.println("Substitution result: " + result);

演示

此代码段只是为了说明捕获组的工作方式：

const regex = /<field name="(.+?)"(.+\s*)?<value.+?>(.+?)<\/value>(\s*)?<\/field>/gmi;
const str = `<Field name="foo" date="20170501">
   <Value type="foo">someVal</Value>
</Field>
<Field name="foo" date="20170501">
   <Value type="foo">someVal</Value>
</Field>
<Field name="foo" date="20170501"><Value type="foo">someVal</Value></Field>
`;
const subst = `$1: $3`;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);

RegEx

如果不需要此表达式，可以在regex101.com中对其进行修改或更改。

RegEx电路

jex.im还有助于可视化表达式。

Answer 3

使用XML时，正则表达式并不是搜索它的确切方法，而应该使用Xpath来解决确切的问题。正则表达式可用于此目的，但我不建议您使用它。

您可以在几个小时内学习xpath，这是学习它的link。

祝你好运

用于匹配XML节点的正则表达式

3 个答案:

测试

演示

RegEx

RegEx电路