使用nltk生成XML的Python代码中的错误

时间:2014-08-12 17:52:34

标签: python xml nltk

我正在使用Python生成一些XML代码。此代码计算语料库中单词的出现次数,并将该单词与数字(来自NLTK语料库的概率分布)进行匹配。

以下是我想要的一些XML示例:

<?xml version="1.0" encoding="UTF-8" ?>
    <root>
    <Durapipe type="int">1</Durapipe>
    <EXPLAIN type="int">2</EXPLAIN>
    <woods type="int">2</woods>
    <hanging type="int">3</hanging>
    <hastily type="int">2</hastily>
    <key type="int" name="27p">1</key>
    <localized type="int">1</localized>
    <Schuster type="int">5</Schuster>
    <regularize type="int">1</regularize>
    ....
    </root>

这是我用来生成这个的Python:

from __future__ import unicode_literals

import nltk.corpus
from nltk import FreqDist
from dicttoxml import dicttoxml, xml_escape

#corpus
words = [w.decode('utf-8', errors='replace') for w in nltk.corpus.reuters.words()]
fd = FreqDist(words)
afd = {xml_escape(k):v for k,v in fd.items()}

# special key for sum
afd['__sum__']=fd.N()

xml = dicttoxml(afd)

f=open('frequencies.xml', 'w')
f.write(xml)
f.close()

不幸的是,Python并不喜欢这样。我收到以下错误:

UnicodeEncodeError                        Traceback (most recent call last)
C:canopylibrarylocation\site-packages\IPython\utils\py3compat.pyc in execfile(fname, glob, loc)
    195             else:
    196                 filename = fname
--> 197             exec compile(scripttext, filename, 'exec') in glob, loc
    198     else:
    199         def execfile(fname, *where):

C:\locationoffile\freq2xml.py in <module>()
      6 
      7 #corpus
----> 8 words = [w.decode('utf-8', errors='replace') for w in nltk.corpus.reuters.words()]
      9 fd = FreqDist(words)
     10 afd = {xml_escape(k):v for k,v in fd.items()}

C:\Users\David Naber\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.4.1.1975.win-x86_64\lib\encodings\utf_8.pyc in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128) 

我的代码用于实际生成XML,但我遇到了nltk.corpus.reuters.words()返回一个单词列表的问题 - 其中并非所有单词都是有效的XML名称,因为有些是句点,逗号或斜杠。我尝试修改我的代码来解决这个问题,现在我还是遇到了上述错误。

您可以提供的任何反馈都将非常感谢。提前谢谢!

0 个答案:

没有答案