解析updateinfo.xml

时间:2016-04-25 22:07:45

标签: python xml

我一直在尝试用Python解析我的大学项目的Amazon updateinfo.xml文件。示例文件如下:

<?xml version="1.0" ?>
<updates>
<update author="linux-security@amazon.com" from="linux-security@amazon.com" status="final" type="security" version="1.4">
<id>AL2012-2014-001</id>
<title>Amazon Linux 2012.03 - AL2012-2014-001: important priority package update for libxml2</title>
<issued date="2014-10-19 15:48" />
<updated date="2014-10-19 15:48" />
<severity>important</severity>
<description>Package updates are available for Amazon Linux that fix the following vulnerabilities:
CVE-2012-5134:
	A heap-based buffer underflow flaw was found in the way libxml2 decoded certain entities. A remote attacker could provide a specially-crafted XML file that, when opened in an application linked against libxml2, would cause the application to crash or, potentially, execute arbitrary code with the privileges of the user running the application.
</description>
<references>
<reference href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-5134" id="CVE-2012-5134" title="" type="cve" />
<reference href="https://rhn.redhat.com/errata/RHSA-2012:1512.html" id="RHSA-2012:1512" title="" type="redhat" />
</references>
<pkglist>
<collection short="amazon-linux">
<name>Amazon Linux</name>
<package arch="x86_64" epoch="0" name="libxml2-debuginfo" release="10.23.26.ec2" version="2.7.8">
<filename>Packages/libxml2-debuginfo-2.7.8-10.23.26.ec2.x86_64.rpm</filename>
</package>
<package arch="x86_64" epoch="0" name="libxml2-devel" release="10.23.26.ec2" version="2.7.8">
<filename>Packages/libxml2-devel-2.7.8-10.23.26.ec2.x86_64.rpm</filename>
</package>
<package arch="x86_64" epoch="0" name="libxml2" release="10.23.26.ec2" version="2.7.8">
<filename>Packages/libxml2-2.7.8-10.23.26.ec2.x86_64.rpm</filename>
</package>
<package arch="x86_64" epoch="0" name="libxml2-static" release="10.23.26.ec2" version="2.7.8">
<filename>Packages/libxml2-static-2.7.8-10.23.26.ec2.x86_64.rpm</filename>
</package>
<package arch="x86_64" epoch="0" name="libxml2-python" release="10.23.26.ec2" version="2.7.8">
<filename>Packages/libxml2-python-2.7.8-10.23.26.ec2.x86_64.rpm</filename>
</package>
</collection>
</pkglist>
</update>
<update author="linux-security@amazon.com" from="linux-security@amazon.com" status="final" type="security" version="1.4">
<id>AL2012-2015-088</id>
<title>Amazon Linux 2012.03 - AL2012-2015-088: medium priority package update for gnutls</title>
<issued date="2015-07-29 20:47" />
<updated date="2015-07-29 20:47" />
<severity>medium</severity>
<description>Package updates are available for Amazon Linux that fix the following vulnerabilities:
CVE-2015-0294:
	It was discovered that GnuTLS did not check if all sections of X.509 certificates indicate the same signature algorithm. This flaw, in combination with a different flaw, could possibly lead to a bypass of the certificate signature check.

CVE-2015-0282:
	It was found that GnuTLS did not verify whether a hashing algorithm listed in a signature matched the hashing algorithm listed in the certificate. An attacker could create a certificate that used a different hashing algorithm than it claimed, possibly causing GnuTLS to use an insecure, disallowed hashing algorithm during certificate verification.

CVE-2014-8155:
	It was found that GnuTLS did not check activation and expiration dates of CA certificates. This could cause an application using GnuTLS to incorrectly accept a certificate as valid when its issuing CA is already expired.
</description>
<references>
<reference href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2014-8155" id="CVE-2014-8155" title="" type="cve" />
<reference href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-0282" id="CVE-2015-0282" title="" type="cve" />
<reference href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-0294" id="CVE-2015-0294" title="" type="cve" />
<reference href="https://rhn.redhat.com/errata/RHSA-2015:1457.html" id="RHSA-2015:1457" title="" type="redhat" />
</references>
<pkglist>
<collection short="amazon-linux">
<name>Amazon Linux</name>
<package arch="x86_64" epoch="0" name="gnutls-debuginfo" release="18.14.al12" version="2.8.5">
<filename>Packages/gnutls-debuginfo-2.8.5-18.14.al12.x86_64.rpm</filename></package>
<package arch="x86_64" epoch="0" name="gnutls" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-2.8.5-18.14.al12.x86_64.rpm</filename></package>
<package arch="x86_64" epoch="0" name="gnutls-devel" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-devel-2.8.5-18.14.al12.x86_64.rpm</filename></package>
<package arch="x86_64" epoch="0" name="gnutls-utils" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-utils-2.8.5-18.14.al12.x86_64.rpm</filename></package>
<package arch="x86_64" epoch="0" name="gnutls-guile" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-guile-2.8.5-18.14.al12.x86_64.rpm</filename></package>
<package arch="i686" epoch="0" name="gnutls-debuginfo" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-debuginfo-2.8.5-18.14.al12.i686.rpm</filename></package>
<package arch="i686" epoch="0" name="gnutls-devel" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-devel-2.8.5-18.14.al12.i686.rpm</filename></package>
<package arch="i686" epoch="0" name="gnutls-guile" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-guile-2.8.5-18.14.al12.i686.rpm</filename></package>
<package arch="i686" epoch="0" name="gnutls" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-2.8.5-18.14.al12.i686.rpm</filename></package>
<package arch="i686" epoch="0" name="gnutls-utils" release="18.14.al12" version="2.8.5"><filename>Packages/gnutls-utils-2.8.5-18.14.al12.i686.rpm</filename></package>
</collection>
</pkglist>
</update>
</updates>

我试图在没有包的情况下删除诸如拱门类型,名称,发布版本和文件名之类的详细信息。

我的问题是,如何有效地对包含上述300个条目的文件执行此操作?由于我对Python的了解有限,我可以设法从一个条目中解决这个问题。但是有这么多(700多个)条目(1.5G文件大小),当我尝试在for循环中运行它时,它消耗了大量资源并且包含乱码。我该怎么做?

1 个答案:

答案 0 :(得分:2)

使用xml.etree module。就我在使用xml.etree时的经验而言,表现很好。

例如:

import xml.etree.ElementTree as ET
tree = ET.parse('updateinfo.xml')
root = tree.getroot()
updates = root.findall('update')

for update in updates:
  packages=update.find('pkglist').find('collection').findall('package')
  for package in packages:
    print(package.attrib['arch'], package.attrib['name'], package.attrib['release'], package.find('filename').text.replace('Packages/',''))

这会产生以下输出(使用python3运行):

x86_64 libxml2-debuginfo 10.23.26.ec2 libxml2-debuginfo-2.7.8-10.23.26.ec2.x86_64.rpm
x86_64 libxml2-devel 10.23.26.ec2 libxml2-devel-2.7.8-10.23.26.ec2.x86_64.rpm
x86_64 libxml2 10.23.26.ec2 libxml2-2.7.8-10.23.26.ec2.x86_64.rpm
x86_64 libxml2-static 10.23.26.ec2 libxml2-static-2.7.8-10.23.26.ec2.x86_64.rpm
x86_64 libxml2-python 10.23.26.ec2 libxml2-python-2.7.8-10.23.26.ec2.x86_64.rpm
x86_64 gnutls-debuginfo 18.14.al12 gnutls-debuginfo-2.8.5-18.14.al12.x86_64.rpm
x86_64 gnutls 18.14.al12 gnutls-2.8.5-18.14.al12.x86_64.rpm
x86_64 gnutls-devel 18.14.al12 gnutls-devel-2.8.5-18.14.al12.x86_64.rpm
x86_64 gnutls-utils 18.14.al12 gnutls-utils-2.8.5-18.14.al12.x86_64.rpm
x86_64 gnutls-guile 18.14.al12 gnutls-guile-2.8.5-18.14.al12.x86_64.rpm
i686 gnutls-debuginfo 18.14.al12 gnutls-debuginfo-2.8.5-18.14.al12.i686.rpm
i686 gnutls-devel 18.14.al12 gnutls-devel-2.8.5-18.14.al12.i686.rpm
i686 gnutls-guile 18.14.al12 gnutls-guile-2.8.5-18.14.al12.i686.rpm
i686 gnutls 18.14.al12 gnutls-2.8.5-18.14.al12.i686.rpm
i686 gnutls-utils 18.14.al12 gnutls-utils-2.8.5-18.14.al12.i686.rpm