嗨,你们所有人都很好!
好的,所以我尝试使用Python中的lxml模块(包?)来解析xml文件。这是我的方法:
def read_xml(filename):
"""Reads xml and strips tags
creates a string with file"""
tree = etree.parse(filename)
no_tags = etree.tostring(tree, encoding='utf-8', method='text')
no_tags = re.sub(ur'[^a-zA-Z0-9]', ' ', no_tags, re.UNICODE)
此功能有效,但对于某些xml文件,我收到此错误:
Traceback (most recent call last):
File "/Users/arashsaidi/Work/TextLab/Code/academic_dictionary/file_io/main.py", line 8, in <module>
read_xml("../Corpus/artikler-xml/fn.xml")
File "/Users/arashsaidi/Work/TextLab/Code/academic_dictionary/file_io/read_single_file.py", line 14, in read_xml
tree = etree.parse(filename)
File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69970)
File "parser.pxi", line 1749, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102081)
File "parser.pxi", line 1775, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:102345)
File "parser.pxi", line 1679, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:101380)
File "parser.pxi", line 1110, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:96832)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476)
File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91772)
lxml.etree.XMLSyntaxError: XML declaration allowed only at the start of the document, line 2, column 6
此文件由一个文件中的多个xml文件组成。我有什么建议解析这样的文件吗?