Question

下面是应该将bz2转换为文本格式的代码。然而;我收到unicode错误。由于我使用的是utf-8，我想知道错误可能是什么

from __future__ import print_function

import logging
import os.path
import six
import sys

from gensim.corpora import WikiCorpus

if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments

    inp =  "trwiki-latest-pages-articles.xml.bz2"
    outp = "wiki_text_dump.txt"
    space = " "
    i = 0

    output = open(outp, 'w')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        if six.PY3:
            output.write(' '.join(text).encode().decode('unicode_escape') + '\n')
        #   ###another method###
        #    output.write(
        #            space.join(map(lambda x:x.decode("utf-8"), text)) + '\n')
        else:
            output.write(space.join(text) + "\n")
            #output.write(text)
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")

    output.close()
    logger.info("Finished Saved " + str(i) + " articles")

错误：

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-42-9404745af31b> in <module>()
     32     for text in wiki.get_texts():
     33         if six.PY3:
---> 34             output.write(' '.join(text).encode().decode('unicode_escape') + '\n')
     35         #   ###another method###
     36         #    output.write(

c:\users\m\appdata\local\programs\python\python37\lib\encodings\cp1254.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20 
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode character '\x9f' in position 47: character maps to <undefined>

我也将“ unicode_escape”替换为“ utf-8”，然后出现此错误

UnicodeEncodeError: 'charmap' codec can't encode characters in position 87-92: character maps to <undefined>

Answer 1

如回溯所示，在调用.encode期间发生错误，而在调用.decode期间发生 not 。因此，您无法通过更改.decode编解码器来解决此问题。

由于代码是在Python 3.x中运行的（six.PY3是正确的-但是您为什么要关注今天编写的新代码中的2.x兼容性？），并且由于' '.join(text)起作用了，我们结论text是一个字符串或字符串列表（不是bytes或bytes列表），而' '.join(text)是一个字符串。实际上，documentation告诉我们WikiCorpus已经提供了字符串。

此字符串包含编解码器cp1254.py（这是专门用于土耳其语文本的Windows代码页）无法编码的某些字符。我不清楚您希望通过编码然后再解码来完成什么。只需使用字符串即可。实际上，text应该已经是一个不需要任何.join修饰的字符串（除非出于某种原因，除非您想在每个字母后放置一个空格）。您应该通过调试自行验证。

UnicodeEncodeError：'charmap'编解码器无法在位置47编码字符'\ x9f'：字符映射到<undefined>

1 个答案: