Question

我有一个代码将docx文件转换为纯文本：

import docx
import glob

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

for file in glob.glob('*.docx'):
    outfile = open(file.replace('.docx', '-out.txt'), 'w', encoding='utf8')


for line in open(file):
    print(getText(filename), end='', file=outfile)
outfile.close()

但是，当我执行它时，会出现以下错误：

Traceback (most recent call last):
  File "C:\Users\User\Desktop\add spaces docx\converting docx to pure text.py", line 16, in <module>
    for line in open(file):
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 591: character maps to <undefined>

我正在使用Python 3.5.2。

任何人都可以帮忙解决此问题吗？

提前致谢。

Answer 1

虽然我不太了解docx模块，但我认为我可以找到解决方案。

根据fileformat，Unicode character 8f（这是charmap编解码器无法解码的，导致UnicodeDecodeError）是control character。

你应该知道当reading files（这似乎是docx模块正在做的事情的情况）时，你应该知道控制字符，因为有时候Python无法解码它

解决方法是放弃docx模块，了解.docx文件的工作方式和格式，以及当您阅读docx文件时，使用open(filename, "rb")以便Python能够解码它

然而，这可能不是问题。如您所见，在目录编码中，它使用cp1512作为编码（默认）而不是utf-8。尝试将其更改为utf_8.py（对我而言，它出现为utf_8.pyc）。

注意：很抱歉没有链接。这是因为我的声誉不高于10（因为我是Stack Overflow的新手）。

UnicodeDecodeError：'charmap'编解码器无法解码位置591中的字节0x8f：字符映射到<undefined>

1 个答案: