lxml解析xml,缺少root错误

时间:2018-05-13 07:33:00

标签: python xml xml-parsing lxml utf-16

我正在尝试解析xml文件,以便我可以对其中包含的数据进行操作。

这是900万行,所以我不会发布它。

这是我的代码:

from lxml import etree

parser = etree.XMLParser(recover = True, encoding = 'utf-16')

tree = etree.parse('xml_parts.xml', parser)

ns = {'d': 'http://www.w3.org/2001/XMLSchema-instance'}

tree.find('d:database', ns)

这是xml文件的第一部分(它是utf-16编码但未在标题中指定):

<?xml version="1.0"?>
<mysqldump xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<database name="parts_bbdb">
  <table_structure name="parts">
    <field Field="part_id" Type="int(11)" Null="NO" Key="PRI" Extra="auto_increment" Comment="" />

我收到的错误是:

Traceback (most recent call last):
File "./AtomParse", line 13, in <module>
tree.find('{http://www.w3.org/2001/XMLSchema-instance}database')
File "src/lxml/etree.pyx", line 2208, in lxml.etree._ElementTree.find (src/lxml/etree.c:68635)
File "src/lxml/etree.pyx", line 1876, in lxml.etree._ElementTree._assertHasRoot (src/lxml/etree.c:65215)
AssertionError: ElementTree not initialized, missing root 

我之前从未解析过XML,但是从阅读lxml文档开始,我认为这应该可行。

我知道XML文件的整体结构,一旦我能够访问元素的属性,我就可以了,但是有一些问题。

如果有人能指出我的方向很好,谢谢!

编辑:

<row>
    <field name="part_id">2557</field>
    <field name="ok">0</field>
    <field name="part_name">BBa_S01288</field>
    <field name="short_desc">Intermediate part from assembly 236</field>
    <field name="description" xsi:nil="true" />
    <field name="part_type">Intermediate</field>
    <field name="author">Randy Rettberg</field>
    <field name="owning_group_id">7</field>
    <field name="status">Deleted</field>
    <field name="dominant">0</field>
    <field name="informational">0</field>
    <field name="discontinued">1</field>
    <field name="part_status"></field>
    <field name="sample_status">Discontinued</field>
    <field name="p_status_cache"></field>
    <field name="s_status_cache"></field>
    <field name="creation_date">2003-12-03</field>
    <field name="m_datetime">2015-05-08 14:14:17</field>
    <field name="m_user_id">0</field>
    <field name="uses">0</field>
    <field name="doc_size">686</field>
    <field name="works"></field>
    <field name="favorite">0</field>
    <field name="specified_u_list">_149_156_603_145_193_147_161_603_145_</field>
    <field name="deep_u_list">_149_156_603_145_193_147_161_603_145_</field>
    <field name="deep_count">9</field>
    <field name="ps_string" xsi:nil="true" />
    <field name="scars"></field>
    <field name="default_scars"></field>
    <field name="owner_id">24</field>
    <field name="group_u_list">_1_</field>
    <field name="has_barcode">0</field>
    <field name="notes" xsi:nil="true" />
    <field name="source"></field>
    <field name="nickname"></field>
    <field name="categories">//classic/intermediate/uncategorized</field>
    <field name="sequence">tcacacaggaaa</field>
    <field name="sequence_sha1">÷?¾TŸ]°f ÜèÕ?]Mò</field>
    <field name="sequence_update">5</field>
    <field name="seq_edit_cache">&lt;script 

在上面的代码中,我认为编码问题来自“sequence_sha1”中从底部的第3行。像这样有400000个块,每个都有这样的一行。

0 个答案:

没有答案