Question

I'm trying to use the library Ternip to add temporay tags to text.。

为了预处理文档，我必须通过其中一个注释器来运行它，我是currently using the TIMEX3 one.

现在它声明它应该接受一个XML文档，我并不完全确定如何将它放入。如果我尝试使用输入字符串。

TT = Timex3XmlDocument(sampledoc)

我收到以下错误：

    221         parser = self.getParser()
    222         try:
--> 223             parser.Parse(string, True)
    224             self._setup_subset(string)
    225         except ParseEscape:

ExpatError: syntax error: line 1, column 0

任何想法如何正确输入文档以便正确注释？

Answer 1

根据XmlDocument的{{3}}（Timex3XmlDocument的基类），参数应为 xml.dom.minidom.Document 的实例或字符串表示格式良好的 XML文档，以便可以将其解析为Document对象（此处包含的源代码的相关部分以便于参考）：

class XmlDocument(object):
    def __init__(self, file, nodename=None, has_S=False, has_LEX=False, pos_attr=False):
        if isinstance(file, xml.dom.minidom.Document):
            self._xml_doc = file
        else:
            self._xml_doc = xml.dom.minidom.parseString(file)

因此，在您的特定情况下，只需确保sampledoc变量引用格式良好的XML字符串。例如，以下工作对我来说很好：

from ternip.formats.timex3 import Timex3XmlDocument
>>> raw = '''<root>
... INDEPENDENCE, Mo. _ The North Atlantic Treaty Organizationembraced three of its former rivals, the Czech Republic,Hungary and Poland on <TIMEX3 tid="t3" type="DATE" value="1999-03-12">Friday</TIMEX3>, formally ending the Sovietdomination of those nations that began after World War IIand opening a new path for the military alliance
... </root>'''
... 
>>> doc = Timex3XmlDocument(raw)
>>> print doc
<?xml version="1.0" ?><root>
INDEPENDENCE, Mo. _ The North Atlantic Treaty Organizationembraced three of its former rivals, the Czech Republic,Hungary and Poland on <TIMEX3 tid="t3" type="DATE" value="1999-03-12">Friday</TIMEX3>, formally ending the Sovietdomination of those nations that began after World War IIand opening a new path for the military alliance
</root>

Ternip无法格式化字符串

1 个答案: