我正在尝试将docx文件转换为文本,但一直收到错误。我正在使用python 2-7
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
回溯:
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 764: character maps to <undefined>
答案 0 :(得分:3)
看起来它不喜欢\ u2019,也可能是\ u2018。这些是左右单引号。我将unicode数据编码为ascii并忽略它无法转换的任何内容以删除它们:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
txt = para.text.encode('ascii', 'ignore')
fullText.append(txt)
return '\n'.join(fullText)
答案 1 :(得分:0)
看起来像这个正确的单引号的问题。你能做点什么:
import docx
def getText(filename):
doc = docx.Document(filename)
new_doc = doc.replace(u"\u2019", "'")
fullText = []
for para in new_doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
通过手机回复所以我无法测试。