在Python中一次解析几个xml文件

时间:2014-06-10 17:56:29

标签: python xml

嗨,你们所有人都很好!

好的,所以我尝试使用Python中的lxml模块(包?)来解析xml文件。这是我的方法:

def read_xml(filename):
    """Reads xml and strips tags
    creates a string with file"""
    tree = etree.parse(filename)
    no_tags = etree.tostring(tree, encoding='utf-8', method='text')
    no_tags = re.sub(ur'[^a-zA-Z0-9]', ' ', no_tags, re.UNICODE)

此功能有效,但对于某些xml文件,我收到此错误:

Traceback (most recent call last):
  File "/Users/arashsaidi/Work/TextLab/Code/academic_dictionary/file_io/main.py", line 8, in <module>
    read_xml("../Corpus/artikler-xml/fn.xml")
  File "/Users/arashsaidi/Work/TextLab/Code/academic_dictionary/file_io/read_single_file.py", line 14, in read_xml
    tree = etree.parse(filename)
  File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69970)
  File "parser.pxi", line 1749, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102081)
  File "parser.pxi", line 1775, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:102345)
  File "parser.pxi", line 1679, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:101380)
  File "parser.pxi", line 1110, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:96832)
  File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290)
  File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476)
  File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91772)
lxml.etree.XMLSyntaxError: XML declaration allowed only at the start of the document, line 2, column 6

此文件由一个文件中的多个xml文件组成。我有什么建议解析这样的文件吗?

0 个答案:

没有答案