XML在标记周围获取文本

时间:2014-05-12 08:35:09

标签: java xml dom xml-parsing

我有一个带有以下架构的XML,我想要检索左右两边的文本(使用JAVA + DOM4j)

   <article>
    <article-meta></article-meta>
    <body>
     <p> 
     Extensible Markup Language (XML) is a markup language that defines a set of
     rules for encoding documents in a format that is both human-readable and machine-
     readable <ref id = 1>1</ref>. It is defined in the XML 1.0 Specification produced
      by the W3C, and several other related specifications
      </p>
      <p>
       Many application programming interfaces (APIs) have been developed to aid 
      software developers with processing XML <ref id = 2>2</ref>. data, and several schema 
       systems exist to aid in the definition of XML-based languages.
      </p>
    </body>
    </article>

我想检索标签周围的文字。例如,这个XML将是

 <ref id = 1>1</ref>

左:人类可读和机器 -          可读

右:它在XML 1.0规范中定义

1 个答案:

答案 0 :(得分:0)

尝试

import java.util.List;

import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.Node;
import org.dom4j.io.SAXReader;

public class TestDom4j {

    public static Document getDocument(final String xmlFileName) {
        Document document = null;
        SAXReader reader = new SAXReader();
        try {
            document = reader.read(xmlFileName);
        } catch (DocumentException e) {
            e.printStackTrace();
        }
        return document;
    }

    /**
     * @param args
     */
    public static void main(String[] args) {
        String xmlFileName = "data.xml";
        String xPath = "//article/body/p";
        Document document = getDocument(xmlFileName);
        List<Node> nodes = document.selectNodes(xPath);
        for (Node node : nodes) {
            String nodeXml = node.asXML();
            System.out.println("Left  >> " + nodeXml.substring(3, nodeXml.indexOf("<ref")).trim());
            System.out.println("Right  >> " + nodeXml.substring(nodeXml.indexOf("</ref>") + 6, nodeXml.length() - 4).trim());
        }
    }
}