I'm trying to use the library Ternip to add temporay tags to text.。
为了预处理文档,我必须通过其中一个注释器来运行它,我是currently using the TIMEX3 one.
现在它声明它应该接受一个XML文档,我并不完全确定如何将它放入。如果我尝试使用输入字符串。
TT = Timex3XmlDocument(sampledoc)
我收到以下错误:
221 parser = self.getParser()
222 try:
--> 223 parser.Parse(string, True)
224 self._setup_subset(string)
225 except ParseEscape:
ExpatError: syntax error: line 1, column 0
任何想法如何正确输入文档以便正确注释?
答案 0 :(得分:1)
根据XmlDocument
的{{3}}(Timex3XmlDocument
的基类),参数应为 xml.dom.minidom.Document
的实例或字符串表示格式良好的 XML文档,以便可以将其解析为Document
对象(此处包含的源代码的相关部分以便于参考):
class XmlDocument(object):
def __init__(self, file, nodename=None, has_S=False, has_LEX=False, pos_attr=False):
if isinstance(file, xml.dom.minidom.Document):
self._xml_doc = file
else:
self._xml_doc = xml.dom.minidom.parseString(file)
因此,在您的特定情况下,只需确保sampledoc
变量引用格式良好的XML字符串。例如,以下工作对我来说很好:
from ternip.formats.timex3 import Timex3XmlDocument
>>> raw = '''<root>
... INDEPENDENCE, Mo. _ The North Atlantic Treaty Organizationembraced three of its former rivals, the Czech Republic,Hungary and Poland on <TIMEX3 tid="t3" type="DATE" value="1999-03-12">Friday</TIMEX3>, formally ending the Sovietdomination of those nations that began after World War IIand opening a new path for the military alliance
... </root>'''
...
>>> doc = Timex3XmlDocument(raw)
>>> print doc
<?xml version="1.0" ?><root>
INDEPENDENCE, Mo. _ The North Atlantic Treaty Organizationembraced three of its former rivals, the Czech Republic,Hungary and Poland on <TIMEX3 tid="t3" type="DATE" value="1999-03-12">Friday</TIMEX3>, formally ending the Sovietdomination of those nations that began after World War IIand opening a new path for the military alliance
</root>