我有一个python
代码,我正在解析xml
文件并从中提取所有tags
。现在我想提取一个与tag
相关的特定值,但在这样做时会发现一些问题。我xml
文件的示例如下所示:
<Cell ss:StyleID="s65"><Data ss:Type="String">Variable Name</Data></Cell>
<Cell ss:StyleID="s65"><Data ss:Type="String">Variable Label</Data></Cell>
<Cell ss:StyleID="s79"><Data ss:Type="String">Minimum Value</Data></Cell>
<Cell ss:StyleID="s79"><Data ss:Type="String">Maximum Value</Data></Cell>
<Cell ss:StyleID="s80"><Data ss:Type="String">Mean Value</Data></Cell>
<Row ss:AutoFitHeight="0" ss:Height="15">
<Cell ss:StyleID="s73"><Data ss:Type="String">Marks</Data></Cell>
<Cell ss:StyleID="s73"><Data ss:Type="String">Marks of Students</Data></Cell>
<Cell ss:StyleID="s82"><Data ss:Type="Number">0</Data></Cell>
<Cell ss:StyleID="s82"><Data ss:Type="Number">96</Data></Cell>
<Cell ss:StyleID="s83"><Data ss:Type="Number">65.71</Data></Cell>
</Row>
现在上面只是我要提取的整个xml文件的一部分。我写了这段代码来打印xml文件中的所有标签:
import xml.etree.ElementTree
xmlTree = xml.etree.ElementTree.parse('sample_xml.xml').getroot()
elemList = []
for elem in xmlTree.iter():
elemList.append(elem.tag) # indent this by tab, not two spaces as I did here
# Just printing out the result
for element in elemList:
print(element)
现在,当我执行此代码时,我看到的是一系列重复的以下示例输出:
{urn:schemas-microsoft-com:office:spreadsheet}Interior
{urn:schemas-microsoft-com:office:spreadsheet}NumberFormat
{urn:schemas-microsoft-com:office:spreadsheet}Protection
{urn:schemas-microsoft-com:office:spreadsheet}Worksheet
{urn:schemas-microsoft-com:office:spreadsheet}Table
{urn:schemas-microsoft-com:office:spreadsheet}Column
{urn:schemas-microsoft-com:office:spreadsheet}Column
{urn:schemas-microsoft-com:office:spreadsheet}Column
{urn:schemas-microsoft-com:office:spreadsheet}Column
{urn:schemas-microsoft-com:office:spreadsheet}Column
{urn:schemas-microsoft-com:office:spreadsheet}Row
{urn:schemas-microsoft-com:office:spreadsheet}Cell
{urn:schemas-microsoft-com:office:spreadsheet}Data
{urn:schemas-microsoft-com:office:spreadsheet}Row
{urn:schemas-microsoft-com:office:spreadsheet}Cell
{urn:schemas-microsoft-com:office:spreadsheet}Data
{urn:schemas-microsoft-com:office:spreadsheet}Row
{urn:schemas-microsoft-com:office:spreadsheet}Cell
{urn:schemas-microsoft-com:office:spreadsheet}Data
{urn:schemas-microsoft-com:office:spreadsheet}Row
{urn:schemas-microsoft-com:office:spreadsheet}Cell
{urn:schemas-microsoft-com:office:spreadsheet}Data
{urn:schemas-microsoft-com:office:spreadsheet}Row
{urn:schemas-microsoft-com:office:spreadsheet}Cell
{urn:schemas-microsoft-com:office:spreadsheet}Data
{urn:schemas-microsoft-com:office:spreadsheet}Row
{urn:schemas-microsoft-com:office:spreadsheet}Cell
{urn:schemas-microsoft-com:office:spreadsheet}Data
我不知道哪个Cell,Data,Row要提取我需要的值(标记,学生标记,最小值,最大值),如开头的样本xml格式所示。我怎么能这样做?
更新:根据建议,我可以使用以下代码提取与代码相关联的文字:
for elem in xmlTree.iter():
if elem.text != None:
print(elem.text)
现在的问题是,在我的xml文件中有大量不同的文本,但我想提取这4个标记文本之后的4个文本 - Marks
,Marks of Students
,{{1 }},Minimum Marks
。我试图使用Maximum Marks
如果迭代器在我的当前标记与next()
匹配时移动到下一个标记并继续按顺序匹配下一个3个标记,但它不会产生所需的结果。这是我写的:
Marks
答案 0 :(得分:0)
我无法使用您在此处指定的XML文件重现该问题。但我怀疑你的xml文件可能是这种格式。
<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
<Interior/>
<NumberFormat/>
<Protection/>
<Worksheet ss:Name="Sheet1">
<Table ss:ExpandedColumnCount="6" ss:ExpandedRowCount="2685" x:FullColumns="1"
x:FullRows="1">
<Column ss:AutoFitWidth="0" ss:Width="26.25"/>
<Column ss:AutoFitWidth="0" ss:Width="117" ss:Span="3"/>
<Column ss:Index="6" ss:AutoFitWidth="0" ss:Width="29.25"/>
<Row ss:AutoFitHeight="0" ss:Height="60">
<Cell ss:StyleID="s22"/>
<Cell ss:StyleID="s23"><Data ss:Type="String">Name</Data></Cell>
<Cell ss:StyleID="s23"><Data ss:Type="String">UserName</Data></Cell>
<Cell ss:StyleID="s23"><Data ss:Type="String">Address</Data></Cell>
<Cell ss:StyleID="s23"><Data ss:Type="String">Telephone Number</Data></Cell>
<Cell ss:StyleID="s22"/>
</Row>
<Row ss:AutoFitHeight="0" ss:Height="30">
<Cell ss:StyleID="s22"/>
<Cell ss:StyleID="s24"><Data ss:Type="String">John Smith</Data></Cell>
<Cell ss:StyleID="s24"><Data ss:Type="String">JSmith</Data></Cell>
<Cell ss:StyleID="s24"><Data ss:Type="String">ABC</Data></Cell>
<Cell ss:StyleID="s24"><Data ss:Type="String">(999) 999-9999</Data></Cell>
<Cell ss:StyleID="s22"/>
</Row>
</Table>
</Worksheet>
</Workbook>
如果相同,则可以使用以下代码。
import xml.etree.cElementTree as etree
with open('sample.xml') as xml_file:
tree = etree.iterparse(xml_file)
for item in tree:
if item[1].text != None:
print item[1].text
我使用了以下参考资料来理解并复制了代码。 Reading Excel xml to dictionary