如何使用正则表达式在XML中获取具有特定值的元素?

时间:2019-03-21 14:03:32

标签: python regex xml

我有这条xml字符串。

<?xml version="1.0" encoding="UTF-8"?>
<xmi:XMI xmi:version="2.0" xmlns:xmi="http://www.omg.org/XMI" xmlns:libraries="http://www.ibm.com/websphere/appserver/schemas/5.0/libraries.xmi">
  <libraries:Library xmi:id="Library_1382473016602" name="sfi_lib" isolatedClassLoader="false">
    <classPath>${HOME_SFI_LIB}/sfi_com_sqw_java.jar</classPath>
  </libraries:Library>
  <libraries:Library xmi:id="Library_1528914932212" name="sfi_lib_server" isolatedClassLoader="false">
    <classPath>${HOME_SFI_LIB}/jasper/jasperreports-5.6.0.jar</classPath>
    <classPath>${HOME_SFI_LIB}/jasper/jasperreports-fonts-3.7.4.jar</classPath>
    <classPath>${HOME_SFI_LIB}/commons/commons-beanutils-1.8.2.jar</classPath>
    <classPath>${HOME_SFI_LIB}/commons/commons-collections-3.2.1.jar</classPath>
    <classPath>${HOME_SFI_LIB}/commons/commons-digester-2.1.jar</classPath>
    <classPath>${HOME_SFI_LIB}/commons/commons-discovery-0.2.jar</classPath>
    <classPath>${HOME_SFI_LIB}/commons/commons-logging-1.1.1.jar</classPath>
    <classPath>${HOME_SFI_LIB}/commons/xml-apis.jar</classPath>
    <classPath>${HOME_SFI_LIB}/commons/iText-2.1.7.jar</classPath>
    <classPath>${HOME_SFI_LIB}/jasper/barbecue-1.5-beta1.jar</classPath>
    <classPath>${HOME_SFI_LIB}/bouncycastle/bcprov-jdk15-1.45.jar</classPath>
    <classPath>${HOME_SFI_LIB}/bouncycastle/bcmail-jdk15-1.45.jar</classPath>
    <classPath>${HOME_SFI_LIB}/bouncycastle/bctsp-jdk14-1.45.jar</classPath>
    <classPath>${HOME_SFI}/sfi_arquivos/templates</classPath>
    <classPath>${HOME_SFI_LIB}/sfi_framework_java.jar</classPath>
    <classPath>${HOME_SFI_LIB}/sfi_adm_ama_java.jar</classPath>
    <classPath>${HOME_SFI_LIB}/sfi_adm_gce_java.jar</classPath>
    <classPath>${HOME_SFI_LIB}/sfi_adm_gdl_java.jar</classPath>
    <classPath>${HOME_SFI_LIB}/sfi_adm_prt_java.jar</classPath>
    <classPath>${HOME_SFI_LIB}/sfi_com_acg_java.jar</classPath>
    <classPath>${HOME_SFI_LIB}/sfi_com_sca_java.jar</classPath>
    <classPath>${HOME_SFI_LIB}/sfi_com_tge_java.jar</classPath>
    <classPath>${HOME_SFI_LIB}/sfi_com_utl_java.jar</classPath>
    <classPath>${HOME_SFI_LIB}/sfi_ext_sge_java.jar</classPath>
  </libraries:Library>
</xmi:XMI>

我想做的是获取以${HOME_SFI_LIB}/sfi_开头的元素的值。 我正在使用re python的模块来完成工作。我当前的代码仅按标签classPath进行过滤,但还不够。我当前使用的正则表达式:

re.findall('<classPath>(.*?)</classPath>', xml)

有人可以帮助我改善RE以便过滤以${HOME_SFI_LIB}/sfi_开头的元素,例如节点<classPath>${HOME_SFI_LIB}/sfi_adm_gce_java.jar</classPath>吗?

1 个答案:

答案 0 :(得分:1)

正如this post所指出的那样,最好使用诸如lxml之类的xml解析器来浏览诸如xml,html和xhtml之类的语言:

from lxml import etree

with open('your_file.xml') as fh:
    tree = etree.parse(fh)

# Now you have an elementTree instance that you can search tags with
# we can use a selector here to return a list
class_paths = tree.xpath('//classPath')

for c in class_paths:
    if '${HOME_SFI_LIB}/sfi_' in c.text:
        # rest of your code

虽然您可能会争辩说对于一个简单的xml文档,正则表达式方法可以起作用,但是通常,树使此过程更容易扩展到更大,更复杂的文档

编辑

如果您无法pip install lxml,则会内置xml软件包并以一种非常相似的方式运行

from xml.etree import ElementTree as ET

with open('your_file.xml') as fh:
    tree = ET.parse(fh)

for element in tree.iterfind('.//classPath'):
    if '${HOME_SFI_LIB}/sfi_' in element.text:
        # rest of your code