Ternip无法格式化字符串

时间:2016-01-26 05:58:10

标签: python xml

I'm trying to use the library Ternip to add temporay tags to text.

为了预处理文档,我必须通过其中一个注释器来运行它,我是currently using the TIMEX3 one.

现在它声明它应该接受一个XML文档,我并不完全确定如何将它放入。如果我尝试使用输入字符串。

TT = Timex3XmlDocument(sampledoc)

我收到以下错误:

    221         parser = self.getParser()
    222         try:
--> 223             parser.Parse(string, True)
    224             self._setup_subset(string)
    225         except ParseEscape:

ExpatError: syntax error: line 1, column 0

任何想法如何正确输入文档以便正确注释?

1 个答案:

答案 0 :(得分:1)

根据XmlDocument的{​​{3}}(Timex3XmlDocument的基类),参数应为 xml.dom.minidom.Document 的实例或字符串表示格式良好的 XML文档,以便可以将其解析为Document对象(此处包含的源代码的相关部分以便于参考):

class XmlDocument(object):
    def __init__(self, file, nodename=None, has_S=False, has_LEX=False, pos_attr=False):
        if isinstance(file, xml.dom.minidom.Document):
            self._xml_doc = file
        else:
            self._xml_doc = xml.dom.minidom.parseString(file)

因此,在您的特定情况下,只需确保sampledoc变量引用格式良好的XML字符串。例如,以下工作对我来说很好:

from ternip.formats.timex3 import Timex3XmlDocument
>>> raw = '''<root>
... INDEPENDENCE, Mo. _ The North Atlantic Treaty Organizationembraced three of its former rivals, the Czech Republic,Hungary and Poland on <TIMEX3 tid="t3" type="DATE" value="1999-03-12">Friday</TIMEX3>, formally ending the Sovietdomination of those nations that began after World War IIand opening a new path for the military alliance
... </root>'''
... 
>>> doc = Timex3XmlDocument(raw)
>>> print doc
<?xml version="1.0" ?><root>
INDEPENDENCE, Mo. _ The North Atlantic Treaty Organizationembraced three of its former rivals, the Czech Republic,Hungary and Poland on <TIMEX3 tid="t3" type="DATE" value="1999-03-12">Friday</TIMEX3>, formally ending the Sovietdomination of those nations that began after World War IIand opening a new path for the military alliance
</root>