字符串Python中子字符串的完全匹配

时间:2020-03-31 11:47:27

标签: python xml lxml

我知道这个问题很普遍,但是下面的示例比问题标题所暗示的复杂得多。

假设我有以下“ test.xml”文件:

<?xml version="1.0" encoding="UTF-8"?>
<test:xml xmlns:test="http://com/whatever/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <parent xsi:type="parentType">
    <child xsi:type="childtype">
      <grandchild>
        <greatgrandchildone>greatgrandchildone</greatgrandchildone>
        <greatgrandchildtwo>greatgrandchildtwo</greatgrandchildtwo>
      </grandchild><!--random comment -->
    </child>
    <child xsi:type="childtype">
      <greatgrandchildthree>greatgrandchildthree</greatgrandchildthree>
      <greatgrandchildfour>greatgrandchildfour</greatgrandchildfour><!--another random comment -->
    </child>
    <child xsi:type="childtype">
      <greatgrandchildthree>greatgrandchildthree</greatgrandchildthree>
      <greatgrandchildfour>greatgrandchildfour</greatgrandchildfour><!--third random comment -->
    </child>
  </parent>
</test:xml>

在下面的程序中,我正在做两件事:

  1. 找出xml中包含“类型”属性的所有节点
  2. 遍历xml的每个节点,并确定它是否是包含“ type”属性的元素的子元素

这是我的代码:

from lxml import etree
import re

xmlDoc = etree.parse("test.xml")
root = xmlDoc.getroot()

nsmap = {
    'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}

nodesWithType = []

def check_type_in_path(nodesWithType, path, root):
    typesInPath = []
    elementType = ""

    for node in nodesWithType:
        print("checking node: ", node, " and path: ", path)

        if re.search(r"\b{}\b".format(
            node), path, re.IGNORECASE) is not None:

            element = root.find('.//{0}'.format(node))
            elementType = element.attrib.get(f"{{{nsmap['xsi']}}}type")
            if elementType is not None:
                print("found an element for this path. adding to list")
                typesInPath.append(elementType)
        else:
            print("element: ", node, " not found in path: ", path)

    print("path ", path ," has types: ", elementType)
    print("-------------------")
    return typesInPath

def get_all_node_types(xmlDoc):
    nodesWithType = []
    root = xmlDoc.getroot()

    for node in xmlDoc.iter():

        path = "/".join(xmlDoc.getpath(node).strip("/").split('/')[1:])

        if "COMMENT" not in path.upper():
            element = root.find('.//{0}'.format(path))
            elementType = element.attrib.get(f"{{{nsmap['xsi']}}}type")
            if elementType is not None:
                nodesWithType.append(path)

    return nodesWithType

nodesWithType = get_all_node_types(xmlDoc)
print("nodesWithType: ", nodesWithType)

for node in xmlDoc.xpath('//*'):
    path = "/".join(xmlDoc.getpath(node).strip("/").split('/')[1:])
    typesInPath = check_type_in_path(nodesWithType, path, root)

代码应返回所有包含在特定路径中的类型。例如,考虑路径parent/child[3]/greatgrandchildfour。此路径是包含属性“类型”的两个节点的子节点(直接的或远离的):parentparent/child[3]。因此,我希望该特定节点的nodesWithType数组同时包含“ parentType”和“ childtype”。

但是,根据下面的打印,此节点的nodesWithType数组仅包含“ parentType”类型,不包含“ childtype”。该逻辑的主要重点是检查到所讨论节点的路径中是否包括该类型节点的路径(因此检查字符串的精确匹配)。但这显然是行不通的。我不确定是否是因为条件中存在无法验证它的数组注释,或者其他原因。

对于上面的示例,返回的打印件是:

checking node:  parent  and path:  parent/child[3]/greatgrandchildfour
found an element for this path. adding to list
checking node:  parent/child[1]  and path:  parent/child[3]/greatgrandchildfour
element:  parent/child[1]  not found in path:  parent/child[3]/greatgrandchildfour
checking node:  parent/child[2]  and path:  parent/child[3]/greatgrandchildfour
element:  parent/child[2]  not found in path:  parent/child[3]/greatgrandchildfour
checking node:  parent/child[3]  and path:  parent/child[3]/greatgrandchildfour
element:  parent/child[3]  not found in path:  parent/child[3]/greatgrandchildfour
path  parent/child[3]/greatgrandchildfour  has types:  parentType

0 个答案:

没有答案