解析XML文件时,是否可以使用lxml.etree跳过第一项或在特定子项处开始迭代?

时间:2019-05-21 19:25:21

标签: python xml parsing xpath lxml

我目前在xlml.etree包中使用.iter方法供Python使用,以解析XML文件。是否可以使用XPath之类的方法跳过第一个条目或在特定子项处开始迭代?

我已经研究过itertext和iterparse方法,但根据它们的定义我不确定,它所要做的不只是帮助将iter缩小到特定的标签,而这已经做了。

import lxml.etree as et

parsedXML = et.parse(file_path)

for child in parsedXML.iter('{http://www.witsml.org/schemas/131}data'):

该代码可以成功解析XML文件,但是我想通过跳过空的或缺少足够数量的字符的行(都用逗号分隔)来减少时间。

<logData>
<data>63653079886,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079887,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079888,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079889,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079890,,29.3,155.8,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079891,,29.3,155.7,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079892,,29.3,155.8,12.25,0.0,0,0,93.76,-87.65,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>

除了每行的11位数字值外,行和行中的行都是空的。在这种情况下,我想跳过该行,并在第一个具有12.25值的行(示例中的第5行)处开始iter。

1 个答案:

答案 0 :(得分:0)

由于data元素只有11位数字,并且逗号(不带空格)为34个字符,因此可以在string length中测试predicate

data[string-length(translate(.,' ','')) > 34]

在检查字符串长度之前,我使用translate()删除了所有空格。

示例...

XML输入(input.xml)

<logData>
    <data>63653079886,,,,,,,,,,,,,,,,,,,,,,,</data>
    <data>63653079887,,,,,,,,,,,,,,,,,,,,,,,</data>
    <data>63653079888,,,,,,,,,,,,,,,,,,,,,,,</data>
    <data>63653079889,,,,,,,,,,,,,,,,,,,,,,,</data>
    <data>63653079889, , , , , , , , , , , , , , , , , , , , , , ,</data>
    <data>63653079890,,29.3,155.8,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
    <data>63653079891,,29.3,155.7,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
    <data>63653079892,,29.3,155.8,12.25,0.0,0,0,93.76,-87.65,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
</logData>

Python (我使用XMLParser()使打印输出更美观。这不是绝对必要的。)

from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)

tree = etree.parse("input.xml", parser=parser)

for data in tree.xpath("data[string-length(translate(.,' ','')) > 34]"):
    print(etree.tostring(data).decode())

输出(打印到控制台)

<data>63653079890,,29.3,155.8,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079891,,29.3,155.7,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079892,,29.3,155.8,12.25,0.0,0,0,93.76,-87.65,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>

如果您真的想测试12.25值,则XPath 1.0谓词中值之前的字符串长度未知时会有些混乱。您可以在substring-afters()中使用一系列substring-before()来实现。虽然不漂亮...

xpath("data[substring-before(substring-after(substring-after(substring-after(substring-after(translate(.,' ',''),','),','),','),','),',') = '12.25']")