使用XPath提取序列子集

时间:2016-06-06 16:42:32

标签: python xml xpath

我正在寻找一个XPATH来提取'设置'作为单独的序列。它必须由python lxml(它是libxml2的包装器)解释。

例如,给出以下内容:

<root>
    <sub1>
        <sub2>
            <Container>
                <item>1 - My laptop has exploded again</item>
                <item>2 - This is an issue which needs to be fixed.</item>
            </Container>
        </sub2>
        <sub2>
            <Container>
                <item>3 - It's still not working</item>
                <item>4 - do we have a working IT department or what?</item>
            </Container>
        </sub2>
        <sub2>
            <Container>
                <item>5 - Never mind - I got my 8 year old niece to fix it</item>
            </Container>
        </sub2>
    </sub1>
</root>

我希望能够隔离&#39;每个组或序列,例如序列1是:

1 - My laptop has exploded again
2 - This is an issue which needs to be fixed.

第二个序列:

3 - It's still not working
4 - do we have a working IT department or what?

第三顺序:

5 - Never mind - I got my 8 year old niece to fix it

其中&#39;序列&#39;将被翻译为伪代码/ python:

seq1 = ['1 - My laptop has exploded again', '2 - This is an issue which needs to be fixed.']
seq2 = ['3 - It's still not working', '4 - do we have a working IT department or what?']
seq 3 = ['5 - Never mind - I got my 8 year old niece to fix it']

从一些初步研究看起来似乎是sequences can't be nested,但我想知道是否有一些黑魔法可以与these operators相提并论。

2 个答案:

答案 0 :(得分:1)

  1. 评估此XPath表达式:

    count(/*/*/*)

  2. 这会找到<sub2>元素的数量(等效且更易读,但更长,是:

    count(/*/sub1/sub2))
    
    1. 对于1到$n中的每个count(/*/*/*),请评估以下XPath表达式:

      /*/*/*[$n]/*/item/text()

    2. 同样,这相当于更长,更易读:

      /*/sub1/sub2[$n]/Container/item/text()
      

      在评估上述表达式之前,将$n替换为$n的实际值(例如,对字符串使用format()方法。

      对于提供的XML文档$n为3,因此评估的实际XPath表达式为:

      /*/*/*[1]/*/item/text()
      

      /*/*/*[2]/*/item/text()
      

      /*/*/*[3]/*/item/text()
      

      他们各自产生以下结果:

      集合(依赖于语言 - 数组,序列,集合,IEnumerable<string>,...等):

      "1 - My laptop has exploded again", "2 - This is an issue which needs to be fixed."
      

      "3 - It's still not working", "4 - do we have a working IT department or what?"
      

      "5 - Never mind - I got my 8 year old niece to fix it"
      

答案 1 :(得分:0)

from lxml import etree

doc=etree.parse("data.xml");
v = doc.findall('sub1/sub2/Container')
finalResult = list()
for vv in v:
    sequence = list()
    for item in vv.findall('item'):
        sequence.append(item.text)
    finalResult.append(sequence)
print finalResult

这就是结果:

[['1 - My laptop has exploded again', '2 - This is an issue which needs to be fixed.'], ["3 - It's still not working", '4 - do we have a working IT department or what?'], ['5 - Never mind - I got my 8 year old niece to fix it']]

注意

我假设数据位于与包含上述代码的脚本相同的目录中名为“data.xml”的文件中。