使用python

时间:2017-04-21 04:18:01

标签: python xml parsing xml-parsing

我有一个python代码,我正在解析xml文件并从中提取所有tags。现在我想提取一个与tag相关的特定值,但在这样做时会发现一些问题。我xml文件的示例如下所示:

<Cell ss:StyleID="s65"><Data ss:Type="String">Variable Name</Data></Cell>
    <Cell ss:StyleID="s65"><Data ss:Type="String">Variable Label</Data></Cell>
    <Cell ss:StyleID="s79"><Data ss:Type="String">Minimum&#10;Value</Data></Cell>
    <Cell ss:StyleID="s79"><Data ss:Type="String">Maximum&#10;Value</Data></Cell>
    <Cell ss:StyleID="s80"><Data ss:Type="String">Mean&#10;Value</Data></Cell>

   <Row ss:AutoFitHeight="0" ss:Height="15">
    <Cell ss:StyleID="s73"><Data ss:Type="String">Marks</Data></Cell>
    <Cell ss:StyleID="s73"><Data ss:Type="String">Marks of Students</Data></Cell>
    <Cell ss:StyleID="s82"><Data ss:Type="Number">0</Data></Cell>
    <Cell ss:StyleID="s82"><Data ss:Type="Number">96</Data></Cell>
    <Cell ss:StyleID="s83"><Data ss:Type="Number">65.71</Data></Cell>
   </Row>

现在上面只是我要提取的整个xml文件的一部分。我写了这段代码来打印xml文件中的所有标签:

import xml.etree.ElementTree
xmlTree = xml.etree.ElementTree.parse('sample_xml.xml').getroot()

elemList = []

for elem in xmlTree.iter():
  elemList.append(elem.tag) # indent this by tab, not two spaces as I did here

# Just printing out the result

for element in elemList:
    print(element)

现在,当我执行此代码时,我看到的是一系列重复的以下示例输出:

{urn:schemas-microsoft-com:office:spreadsheet}Interior
{urn:schemas-microsoft-com:office:spreadsheet}NumberFormat
{urn:schemas-microsoft-com:office:spreadsheet}Protection
{urn:schemas-microsoft-com:office:spreadsheet}Worksheet
{urn:schemas-microsoft-com:office:spreadsheet}Table
{urn:schemas-microsoft-com:office:spreadsheet}Column
{urn:schemas-microsoft-com:office:spreadsheet}Column
{urn:schemas-microsoft-com:office:spreadsheet}Column
{urn:schemas-microsoft-com:office:spreadsheet}Column
{urn:schemas-microsoft-com:office:spreadsheet}Column
{urn:schemas-microsoft-com:office:spreadsheet}Row
{urn:schemas-microsoft-com:office:spreadsheet}Cell
{urn:schemas-microsoft-com:office:spreadsheet}Data
{urn:schemas-microsoft-com:office:spreadsheet}Row
{urn:schemas-microsoft-com:office:spreadsheet}Cell
{urn:schemas-microsoft-com:office:spreadsheet}Data
{urn:schemas-microsoft-com:office:spreadsheet}Row
{urn:schemas-microsoft-com:office:spreadsheet}Cell
{urn:schemas-microsoft-com:office:spreadsheet}Data
{urn:schemas-microsoft-com:office:spreadsheet}Row
{urn:schemas-microsoft-com:office:spreadsheet}Cell
{urn:schemas-microsoft-com:office:spreadsheet}Data
{urn:schemas-microsoft-com:office:spreadsheet}Row
{urn:schemas-microsoft-com:office:spreadsheet}Cell
{urn:schemas-microsoft-com:office:spreadsheet}Data
{urn:schemas-microsoft-com:office:spreadsheet}Row
{urn:schemas-microsoft-com:office:spreadsheet}Cell
{urn:schemas-microsoft-com:office:spreadsheet}Data

我不知道哪个Cell,Data,Row要提取我需要的值(标记,学生标记,最小值,最大值),如开头的样本xml格式所示。我怎么能这样做?

更新:根据建议,我可以使用以下代码提取与代码相关联的文字:

for elem in xmlTree.iter():
    if elem.text != None:
        print(elem.text)

现在的问题是,在我的xml文件中有大量不同的文本,但我想提取这4个标记文本之后的4个文本 - MarksMarks of Students,{{1 }},Minimum Marks。我试图使用Maximum Marks如果迭代器在我的当前标记与next()匹配时移动到下一个标记并继续按顺序匹配下一个3个标记,但它不会产生所需的结果。这是我写的:

Marks

1 个答案:

答案 0 :(得分:0)

我无法使用您在此处指定的XML文件重现该问题。但我怀疑你的xml文件可能是这种格式。

<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:x="urn:schemas-microsoft-com:office:excel"
 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:html="http://www.w3.org/TR/REC-html40">
<Interior/>
<NumberFormat/>
<Protection/>
<Worksheet ss:Name="Sheet1">
<Table ss:ExpandedColumnCount="6" ss:ExpandedRowCount="2685" x:FullColumns="1"
x:FullRows="1">
<Column ss:AutoFitWidth="0" ss:Width="26.25"/>
<Column ss:AutoFitWidth="0" ss:Width="117" ss:Span="3"/>
<Column ss:Index="6" ss:AutoFitWidth="0" ss:Width="29.25"/>
<Row ss:AutoFitHeight="0" ss:Height="60">
<Cell ss:StyleID="s22"/>
<Cell ss:StyleID="s23"><Data ss:Type="String">Name</Data></Cell>
<Cell ss:StyleID="s23"><Data ss:Type="String">UserName</Data></Cell>
<Cell ss:StyleID="s23"><Data ss:Type="String">Address</Data></Cell>
<Cell ss:StyleID="s23"><Data ss:Type="String">Telephone Number</Data></Cell>
<Cell ss:StyleID="s22"/>
</Row>
<Row ss:AutoFitHeight="0" ss:Height="30">
<Cell ss:StyleID="s22"/>
<Cell ss:StyleID="s24"><Data ss:Type="String">John Smith</Data></Cell>
<Cell ss:StyleID="s24"><Data ss:Type="String">JSmith</Data></Cell>
<Cell ss:StyleID="s24"><Data ss:Type="String">ABC</Data></Cell>
<Cell ss:StyleID="s24"><Data ss:Type="String">(999) 999-9999</Data></Cell>
<Cell ss:StyleID="s22"/>
</Row>
</Table>
</Worksheet>
</Workbook>

如果相同,则可以使用以下代码。

import xml.etree.cElementTree as etree

with open('sample.xml') as xml_file:
    tree = etree.iterparse(xml_file)
    for item in tree:
        if item[1].text != None:
            print item[1].text

我使用了以下参考资料来理解并复制了代码。 Reading Excel xml to dictionary