在xml解析中按标记名称获取元素,不包括某些父项的子项

时间:2018-05-29 19:29:26

标签: java xml

我有一个我正在解析的xml文件。虽然有些标签名称碰巧多次发生,但在不同的父名下。我知道哪个父母的孩子我想忽略。我怎样才能做到这一点?

 <sub-article id="S01" article-type="translation" xml:lang="pt">
  <front-stub>
     <article-categories>
        <subj-group subj-group-type="heading">
           <subject>Artigos Originais</subject>
        </subj-group>
     </article-categories>
     <title-group>
        <article-title>
           Prevalência de deficiência nutricional em pacientes com
            tuberculose pulmonar
           <xref ref-type="fn" rid="fn02">*</xref>
        </article-title>
     </title-group>
   </front-stub>
  </article-categories>
 </sub-article>        
    .....
    .....
 <article-meta>
     <article-id pub-id-type="pmid">24068270</article-id>
     <article-id pub-id-type="pmc">4075858</article-id>
     <article-id pub-id-type="publisher-id">S1806-37132013000400012</article-id>
     <article-id pub-id-type="doi">10.1590/S1806-37132013000400012</article-id>
     <article-categories>
        <subj-group subj-group-type="heading">
           <subject>Original Articles</subject>
        </subj-group>
     </article-categories>
     <title-group>
        <article-title>
           Prevalence of nutritional deficiency in patients with
           pulmonary tuberculosis
           <xref ref-type="fn" rid="fn01">*</xref>
        </article-title>
     </title-group>
    <article-meta>

在此示例中,我不想处理子文章标记下的子项。因此,“文章标题”仅针对“肺结核患者营养缺乏的患病率”进行处理,而不是“Prevalênciadeinfeiêncianutricionalem pacientes com tuberculose pulmonar”

我目前正在关注代码,该代码返回名称为“title-group”的所有节点。如何使其具体化,以便我不会从某个父级获取它。

NodeList titleNodeList = document.getElementsByTagName("title-group");

2 个答案:

答案 0 :(得分:1)

只需在“子文章”节点下搜索“title-group”节点:

List<Node> allTitleGroupNodes = new ArrayList<>();
NodeList subArticleNodes = document.getElementsByTagName("sub-article");
for (int i = 0; i < subArticleNodes.getLength(); i++) {
    NodeList titleNodes = subArticleNodes.item(i).getElementsByTagName("title-group");
    for (int j = 0; j < titleNodes.getLength(); j++) {
        allTitleGroupNodes.add(titleNodes.item(j));
    }
}

(旁白:NodeList的可怕界面是我最讨厌在标准Java中处理XML的事情之一。)

答案 1 :(得分:1)

使用XPath有两种方法可以实现它:

  1. 包含目标元素名称<article-meta>
  2. 排除目标元素名称<sub-article>
  3. 我个人更喜欢第一个,因为它更明确,并且始终面向不同的XML文件。

    解决方案1包含

    使用XPath仅选择<article-meta>下的元素:

    //article-meta//title-group
    

    爪哇:

    XPath xPath = XPathFactory.newInstance().newXPath();
    XPathExpression expr = xPath.compile("//article-meta//title-group");
    NodeList titleNodes = (NodeList) expr.evaluate(document, XPathConstants.NODESET);
    

    解决方案2排除

    如果元素位于<sub-article>下,请使用XPath排除元素。我假设XML根元素是<article>(如果不是这样的话,请证明代码的合理性):

    /article/*[not(self::sub-article)]//title-group
    

    爪哇

    XPath xPath = XPathFactory.newInstance().newXPath();
    XPathExpression expr = xPath.compile("/article/*[not(self::sub-article)]//title-group");
    NodeList titleNodes = (NodeList) expr.evaluate(document, XPathConstants.NODESET);