Question

我正在使用几个巨大的（＆gt; 2gb）XML文件，它们的大小会导致问题。

（例如，我在PHP脚本中使用XMLReader来解析较小的~500mb文件，并且工作正常，但是32位PHP can't open files this large.）

所以 - 我的想法是消除我知道我不需要的大块文件。

例如，如果文件的结构如下所示：

<record id="1">
    <a>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </a>
    <b>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </b>
    <c>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </c>
</record>
...
<record id="999999">
    <a>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </a>
    <b>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </b>
    <c>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </c>
</record>

出于我的目的 - 我只需要父节点<a>中的数据用于每条记录。如果我可以从每个记录中删除父节点<b>和<c> ，我可以大幅减少文件的大小，因此它会很小足以正常工作。

做这样的事情的最佳方式是什么（希望使用sed或grep或免费/廉价的应用程序？

我已经尝试过Altova XML Spy的试用版，它甚至不会打开XML文件（我认为它是因为它太大了）。

Answer 1

因为你提到sed和awk我假设你在linux下。

如果您有xsltproc实用程序......

提供测试文件的更正版本

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet href="project.xsl" type="text/xsl"?>

<records>
<record id="1">
    <a>
        <detail>hello</detail>
        bar
        <detail>world</detail>
    </a>
    <b>
        <detail>blah</detail>
        <detail>blah</detail>
    </b>
    <c>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </c>
</record>
<record id="999999">
    <a>
        <detail>blah</detail>
        foo
        <detail>blah blah</detail>
    </a>
    <b>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </b>
    <c>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </c>
</record>
</records>

和相应的xsl;

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">


<xsl:output method="xml"  />
<xsl:template match="records">
<xsl:element name="records">

<xsl:for-each select="record">
<xsl:element name="record">
<xsl:attribute name="id"><xsl:value-of select="@id" /></xsl:attribute>
<xsl:copy-of select="./a" />
</xsl:element>

</xsl:for-each>

</xsl:element>

</xsl:template>
</xsl:stylesheet>

的结果

xsltproc extract.xsl  record.xml

将是

<?xml version="1.0"?>
<records><record id="1"><a>
        <detail>hello</detail>
        bar
        <detail>world</detail>
    </a></record><record id="999999"><a>
        <detail>blah</detail>
        foo
        <detail>blah blah</detail>
    </a></record></records>

这接近你的预期吗？

用于从HUGE（＆gt; 2gb）XML文件中删除节点的实用程序

1 个答案: