Question

我正在使用Python生成一些XML代码。此代码计算语料库中单词的出现次数，并将该单词与数字（来自NLTK语料库的概率分布）进行匹配。

以下是我想要的一些XML示例：

<?xml version="1.0" encoding="UTF-8" ?>
    <root>
    <Durapipe type="int">1</Durapipe>
    <EXPLAIN type="int">2</EXPLAIN>
    <woods type="int">2</woods>
    <hanging type="int">3</hanging>
    <hastily type="int">2</hastily>
    <key type="int" name="27p">1</key>
    <localized type="int">1</localized>
    <Schuster type="int">5</Schuster>
    <regularize type="int">1</regularize>
    ....
    </root>

这是我用来生成这个的Python：

from __future__ import unicode_literals

import nltk.corpus
from nltk import FreqDist
from dicttoxml import dicttoxml, xml_escape

#corpus
words = [w.decode('utf-8', errors='replace') for w in nltk.corpus.reuters.words()]
fd = FreqDist(words)
afd = {xml_escape(k):v for k,v in fd.items()}

# special key for sum
afd['__sum__']=fd.N()

xml = dicttoxml(afd)

f=open('frequencies.xml', 'w')
f.write(xml)
f.close()

不幸的是，Python并不喜欢这样。我收到以下错误：

UnicodeEncodeError                        Traceback (most recent call last)
C:canopylibrarylocation\site-packages\IPython\utils\py3compat.pyc in execfile(fname, glob, loc)
    195             else:
    196                 filename = fname
--> 197             exec compile(scripttext, filename, 'exec') in glob, loc
    198     else:
    199         def execfile(fname, *where):

C:\locationoffile\freq2xml.py in <module>()
      6 
      7 #corpus
----> 8 words = [w.decode('utf-8', errors='replace') for w in nltk.corpus.reuters.words()]
      9 fd = FreqDist(words)
     10 afd = {xml_escape(k):v for k,v in fd.items()}

C:\Users\David Naber\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.4.1.1975.win-x86_64\lib\encodings\utf_8.pyc in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128)

我的代码用于实际生成XML，但我遇到了nltk.corpus.reuters.words（）返回一个单词列表的问题 - 其中并非所有单词都是有效的XML名称，因为有些是句点，逗号或斜杠。我尝试修改我的代码来解决这个问题，现在我还是遇到了上述错误。

您可以提供的任何反馈都将非常感谢。提前谢谢！

使用nltk生成XML的Python代码中的错误

0 个答案: