我正在使用Python生成一些XML代码。此代码计算语料库中单词的出现次数,并将该单词与数字(来自NLTK语料库的概率分布)进行匹配。
以下是我想要的一些XML示例:
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<Durapipe type="int">1</Durapipe>
<EXPLAIN type="int">2</EXPLAIN>
<woods type="int">2</woods>
<hanging type="int">3</hanging>
<hastily type="int">2</hastily>
<key type="int" name="27p">1</key>
<localized type="int">1</localized>
<Schuster type="int">5</Schuster>
<regularize type="int">1</regularize>
....
</root>
这是我用来生成这个的Python:
from __future__ import unicode_literals
import nltk.corpus
from nltk import FreqDist
from dicttoxml import dicttoxml, xml_escape
#corpus
words = [w.decode('utf-8', errors='replace') for w in nltk.corpus.reuters.words()]
fd = FreqDist(words)
afd = {xml_escape(k):v for k,v in fd.items()}
# special key for sum
afd['__sum__']=fd.N()
xml = dicttoxml(afd)
f=open('frequencies.xml', 'w')
f.write(xml)
f.close()
不幸的是,Python并不喜欢这样。我收到以下错误:
UnicodeEncodeError Traceback (most recent call last)
C:canopylibrarylocation\site-packages\IPython\utils\py3compat.pyc in execfile(fname, glob, loc)
195 else:
196 filename = fname
--> 197 exec compile(scripttext, filename, 'exec') in glob, loc
198 else:
199 def execfile(fname, *where):
C:\locationoffile\freq2xml.py in <module>()
6
7 #corpus
----> 8 words = [w.decode('utf-8', errors='replace') for w in nltk.corpus.reuters.words()]
9 fd = FreqDist(words)
10 afd = {xml_escape(k):v for k,v in fd.items()}
C:\Users\David Naber\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.4.1.1975.win-x86_64\lib\encodings\utf_8.pyc in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_8_decode(input, errors, True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128)
我的代码用于实际生成XML,但我遇到了nltk.corpus.reuters.words()返回一个单词列表的问题 - 其中并非所有单词都是有效的XML名称,因为有些是句点,逗号或斜杠。我尝试修改我的代码来解决这个问题,现在我还是遇到了上述错误。
您可以提供的任何反馈都将非常感谢。提前谢谢!