尝试从庞大的xml文档中提取特定标签/属性

时间:2019-05-20 16:17:37

标签: xml xmllint

我正在尝试在Linux系统上使用xmllint从庞大(> 150万行)的xml文档中提取一些特定数据,并且对xmllint语法不太满意。我一直使用grep和awk非常低效地执行此操作,但是我发现该系统具有xmllint实用程序(我从未使用过),并且我发现由于xml结构良好,因此应该有一种直接访问数据的方法。我已经包含了xml文档的一个片段,但是在进行缩减时,虽然看起来对我来说是正确的,但是却导致xmllint出现了解析器错误。我认为,如果您精通xmllint足以回答我的问题,则可以轻松找出解析器错误。

基于网络搜索,我尝试了以下语法:

cat //*/@index' | xmllint --shell stub.xml (which does return ALL of the "indexes")
and
test=$(xmllint --debug --xpath "//PTC/BPSETS/BPSET/BPS" stub.xml) (which does dump the entire BPS entry)
and
xmllint --xpath "string(//PTC/BPSETS/BPSET/@b95)" stub.xml (returns no values)

Here is the xml snippet as best as I can trim it down:

<?xml version="1.0" encoding="utf-8"?>
<PTC version="2.0" cls="2">
  <BPSETS>
    <BPSET define="b95">
      <BPS define="88lmax">
        <CRIT>
          <MNBS lmt="88" />
          <MXBS lmt="88" />
          <MXBT red="Y" />
        </CRIT>
        <PNS>
          <PN index="0" atv="1" bf="32203506">
            <AWD cpbt="390">
              <BUNS ptgp="bn38" bdx="38" fawd="1" />
              <BUNS ptgp="bn39" bdx="39" fawd="1" awby="38" />
            </AWD>
          </PN>
          <PN index="1" atv="1" bf="24237243">
            <AWD cpbt="390">
              <BUNS ptgp="bn38" bdx="38" fawd="1" />
              <BUNS ptgp="bn39" bdx="39" fawd="1" awby="38" />
            </AWD>
          </PN>
          <PN index="2" atv="1" bf="8136575">
            <AWD cpbt="390">
              <BUNS ptgp="bn38" bdx="38" fawd="1" />
              <BUNS ptgp="bn39" bdx="39" fawd="1" awby="38" />
        </AWD>
      </PN>
          <PN index="688" atv="1" bf="1183872">
            <AWD cpbt="50" />
          </PN>
        </PNS>
      </BPS>
      <BPS define="88l6">
        <CRIT>
          <MNBS lmt="88" />
          <MXBS lmt="88" />
          <MXBT lmt="6" />
          <MNBT lmt="6" />
        </CRIT>
        <PNS>
          <PN index="0" atv="1" bf="28073582">
            <AWD cpbt="150">
              <BUNS ptgp="bn38" bdx="38" fawd="1" />
              <BUNS ptgp="bn39" bdx="39" fawd="1" awby="38" />
            </AWD>
          </PN>
          <PN index="1" atv="1" bf="16686973">
            <AWD cpbt="150">
              <BUNS ptgp="bn38" bdx="38" fawd="1" />
              <BUNS ptgp="bn39" bdx="39" fawd="1" awby="38" />
            </AWD>
          </PN>
    </PNS>

      </BPS>
      <BPS define="88l4">
        <CRIT>
          <MNBS lmt="88" />
          <MXBS lmt="88" />
          <MXBT lmt="4" />
          <MNBT lmt="4" />
        </CRIT>
        <PNS>
          <PN index="0" atv="1" bf="31342257">
            <AWD cpbt="50">
              <BUNS ptgp="bn38" bdx="38" fawd="1" />
              <BUNS ptgp="bn39" bdx="39" fawd="1" awby="38" />
            </AWD>
          </PN>
          <PN index="1" atv="1" bf="13761775">
            <AWD cpbt="50">
              <BUNS ptgp="bn38" bdx="38" fawd="1" />
              <BUNS ptgp="bn39" bdx="39" fawd="1" awby="38" />
            </AWD>
          </PN>
        </PNS>
      </BPS>
      <BPS define="88l2">
        <CRIT>
          <MNBS lmt="88" />
          <MXBS lmt="88" />
          <MXBT lmt="2" />
          <MNBT lmt="2" />
        </CRIT>
        <PNS>
          <PN index="0" atv="1" bf="16291759">
            <AWD cpbt="10">
              <BUNS ptgp="bn38" bdx="38" fawd="1" />
              <BUNS ptgp="bn39" bdx="39" fawd="1" awby="38" />
            </AWD>
          </PN>
          <PN index="1" atv="1" bf="15032283">
            <AWD cpbt="10">
              <BUNS ptgp="bn38" bdx="38" fawd="1" />
              <BUNS ptgp="bn39" bdx="39" fawd="1" awby="38" />
            </AWD>
          </PN>
        </PNS>
      </BPS>
      <BPS define="88l1">
        <CRIT>
          <MNBS lmt="88" />
          <MXBS lmt="88" />
          <MXBT lmt="1" />
          <MNBT lmt="1" />
        </CRIT>
        <PNS>
          <PN index="0" atv="1" bf="33278739">
            <AWD>
              <BUNS ptgp="bn38" bdx="38" fawd="1" />
              <BUNS ptgp="bn39" bdx="39" fawd="1" awby="38" />
            </AWD>
          </PN>
          <PN index="1" atv="1" bf="7261567">
            <AWD>
              <BUNS ptgp="bn38" bdx="38" fawd="1" />
              <BUNS ptgp="bn39" bdx="39" fawd="1" awby="38" />
            </AWD>
          </PN>
          <PN index="896" atv="1" bf="101540">
            <AWD cpbt="10" />
          </PN>
          <PN index="897" atv="1" bf="3680792">
            <AWD cpbt="10" />
          </PN>
          <PN index="898" atv="1" bf="25776896">
            <AWD cpbt="10" />
          </PN>
        </PNS>
      </BPS>
    </BPSET>

    <BPSET define="b94" use="b95">
      <BPS define="88mx">
        <PNS>
          <PN index="422" atv="1" bf="11692089">
            <AWD cpbt="9000" />
          </PN>
          <PN index="424" atv="1" bf="12200338">
            <AWD cpbt="7200" />
          </PN>
          <PN index="427" atv="1" bf="24210225">
            <AWD cpbt="6000" />
          </PN>
       </PNS>
      <BPS>
    </BPSET>

  </BPSETS>
</PTC>





What I really need is a query that returns all the attribute's contained in a specific element  under a specific index e.g.:


<!-- language: lang-xml -->

    <PTC version="2.0" cls="2">
      <PN index="0" atv="1" bf="32203506">
        <AWD cpbt="390">
          <BUNS ptgp="bn38" bdx="38" fawd="1" />
          <BUNS ptgp="bn39" bdx="39" fawd="1" awby="38" />
        </AWD>
      </PN>


A query that given a PN index value (e.g. 0) would return the values of bf and cbpt…

If it were an sql query the xmllint query I'm looking for would be something like:
```sql
select bf,cbpt from PTC.BPSETS.BPSET.BPS.PNS.PN 
where BPSET = "b95" AND BPS = 88lmax AND PN.index = 0.

如果您跟随我的漂泊。 这里的任何指导表示赞赏。谢谢。

1 个答案:

答案 0 :(得分:0)

进一步的研究和实验表明这是所需的语法:

echo'cat // PTC / BPSETS / BPSET [@ define =“ b95”] / BPS [@ define =“ 88lmax”] / PNS / PN [@ index =“ 0”] / AWD / @ cpbt'| xmllint --shell stub.xml

这将产生所需的数据。